三个python爬虫项目实例代码

2023年5月14日下午8:37 • python

yizhihongxing

三个python爬虫项目实例代码完整攻略

项目简介

本项目是针对python爬虫初学者提供的三个实例爬虫代码，分别是：

爬取豆瓣图书TOP250的书籍信息
爬取天猫商城的商品信息及评论
爬取GitHub上的开源项目信息

每个项目的代码都包括了完整的数据爬取和存储代码，可以作为初学者进行学习和实践的完整资料。

项目目标

在三个不同的爬虫项目中，我们将能够学习到：

如何通过HTTP请求获取网页的HTML代码
如何通过正则表达式筛选获取所需数据
如何通过XPath、CSS Selector获取网页元素信息
如何通过Python代码完成数据的存储、处理和可视化

项目实现

项目一：爬取豆瓣图书TOP250的书籍信息

爬取目标：豆瓣图书TOP250榜单，包括书名、作者、出版社、出版日期、评分等信息
实现过程：
1. 利用requests库发送HTTP请求，获取HTML代码
2. 利用正则表达式筛选所需数据
3. 将数据存储至CSV文件中
示例代码：

import requests
import re
import csv

def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return None


def parse_html(html):
    pattern = re.compile(
        r'<div class="pl2">.*?<a href="(.*?)" title="(.*?)">', re.S)
    items = re.findall(pattern, html)
    return items


def save_csv(items):
    with open('books.csv', 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        for item in items:
            writer.writerow([item[1], item[0]])


if __name__ == '__main__':
    for i in range(0, 250, 25):
        url = 'https://book.douban.com/top250?start=' + str(i)
        html = get_html(url)
        items = parse_html(html)
        save_csv(items)

项目二：爬取天猫商城的商品信息及评论

爬取目标：天猫商城的商品信息及评论，包括商品名称、价格、销量、评论数量等信息
实现过程：
1. 利用Selenium库模拟浏览器行为，获取HTML代码
2. 利用Xpath方式或CSS Selector方式获取所需数据
3. 将数据存储至MongoDB中
示例代码：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from pymongo import MongoClient


def get_html(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)
    assert '天猫tmall.com' in driver.title
    # 等待页面加载完毕
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, 'J_ItemList')))
    html = driver.page_source
    driver.quit()
    return html


def parse_html(html):
    items = []
    product_list = html.xpath('//div[@id="J_ItemList"]/div[@class="product "]/div[@class="product-iWrap"]')
    for product in product_list:
        item = {}
        item['name'] = product.xpath('.//p[@class="productTitle"]/a/@title')[0]
        item['price'] = product.xpath('.//p[@class="productPrice"]/em/@title')[0]
        item['sales'] = product.xpath('.//div[@class="productStatus"]/span[@class="productStatus-sellCount"]/em/text()')[0]
        item['comments'] = product.xpath('.//div[@class="productStatus"]/span[@class="productStatus-comment"]/a/text()')[0]
        items.append(item)
    return items


def save_mongodb(items):
    client = MongoClient('localhost', 27017)
    db = client['tmall']
    collection = db['goods']
    result = collection.insert_many(items)
    print(f'Inserted {len(result.inserted_ids)} items')


if __name__ == '__main__':
    url = 'https://list.tmall.com/search_product.htm?q=%CD%AF%D2%A9%C6%B7&sort=d&style=w&from=mallfp..pc_1_searchbutton'
    html = get_html(url)
    items = parse_html(html)
    save_mongodb(items)

项目三：爬取GitHub上的开源项目信息

爬取目标：GitHub上的开源项目信息，包括项目名称、作者、描述、星级等信息
实现过程：
1. 利用GitHub的API获取开源项目信息
2. 解析JSON数据，获取所需信息
3. 将数据存储至MySQL数据库中
示例代码：

import requests
import json
import mysql.connector


def get_data(url):
    response = requests.get(url)
    if response.status_code == 200:
        return json.loads(response.text)
    else:
        return None


def parse_data(data):
    items = []
    for repo in data:
        item = {}
        item['name'] = repo['name']
        item['author'] = repo['owner']['login']
        item['description'] = repo['description']
        item['stars'] = repo['stargazers_count']
        items.append(item)
    return items


def save_mysql(items):
    conn = mysql.connector.connect(user='root', password='root', database='github')
    cursor = conn.cursor()
    for item in items:
        cursor.execute(
            'INSERT INTO repos (name, author, description, stars) VALUES (%s, %s, %s, %s)', (item['name'], item['author'], item['description'], item['stars']))
    conn.commit()
    cursor.close()
    conn.close()


if __name__ == '__main__':
    url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
    data = get_data(url)
    items = parse_data(data['items'])
    save_mysql(items)

总结

本项目介绍了三个实例Python爬虫项目的完整实现过程，旨在帮助爬虫初学者巩固所学知识，掌握实际操作技能。同时，项目中也体现了Python爬虫领域的基本技能和注意事项，如HTTP请求、数据筛选、数据存储等内容。如果您在项目实施过程中依然存在疑问，欢迎随时向我提问。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：三个python爬虫项目实例代码 - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

Python异步爬虫实现原理与知识总结

上一篇 2023年5月14日

python中with的具体用法

下一篇 2023年5月14日

Python实现”验证回文串”的几种方法

以下是详细讲解“Python实现“验证回文串”的几种方法”的完整攻略。方法一：双指针法双指针法是一种常用的验证回文串的方法。具体来说，我们可以使用两个指针，一个指向字符串的开头，一个指向字符串的结尾，然后逐个比较字符是否相等。如果相等，则继续比较下一个字符，直到两个指针相遇或者出现不相等的字符。下面是一个示例，演示如何使用双指针法验证回文串： def …

python 2023年5月14日
000
解决pycharm下pyuic工具使用的问题

以下是关于解决 PyCharm 下 pyuic 工具使用的问题的详细攻略：问题描述在 PyCharm 中使用 pyuic 工具将 Qt Designer 设计的 .ui 文件转换为 Python 代码时，可能会遇到一些问题例如找不到 pyuic 工具、转换后的代码无法运行等。本文将介绍如何解决这些问题。解决方法以下是解决 PyCharm 下 pyui…

python 2023年5月13日
000
python的pstuil模块使用方法总结

Python的pstuil模块使用方法总结什么是pstuil模块 Pstuil模块是一个python编写的可跨平台进程管理模块，支持Unix和Windows系统。该模块可以轻松地利用进程号或进程名对进程进行管理，如获取进程的CPU时间、进程状态等信息；还可以轻松地启动、停止或杀死进程等。安装pstuil模块你可以使用pip来安装pstuil模块，命令如…

python 2023年5月30日
000
Python – 文件处理 – 无法将’int’对象隐式转换为str [重复]

【问题标题】：Python – File handling – Can’t convert ‘int’ object to str implicitly [duplicate]Python – 文件处理 – 无法将’int’对象隐式转换为str [重复] 【发布时间】：2023-04-06 11:31:01 【问题描述】：我正在尝试将冒险游戏的故事从文件读…

Python开发 2023年4月6日
000
详解python数据结构和算法

详解Python数据结构和算法完整攻略简介 Python是一种强大的脚本语言，很多人都使用它来进行编程工作。Python提供了大量的数据结构和算法，可以用来解决各种问题。本攻略将详细介绍Python的数据结构和算法，以及如何使用它们来解决问题。数据结构列表(list) 列表是Python中最基本的数据结构之一。它是一个有序的对象集合，可以包含任意数量的…

python 2023年5月14日
000
PYTHON 爬虫笔记七:Selenium库基础用法

什么是Selenium 　　　　selenium 是一套完整的web应用程序测试系统，包含了测试的录制（selenium IDE）,编写及运行（Selenium Remote Control）和测试的并行处理（Selenium Grid）。　　　　　　selenium的核心Selenium Core基于JsUnit，完全由JavaScript编写，因此可以…

爬虫 2023年4月11日
000
python 基于opencv 实现一个鼠标绘图小程序

下面我将为您详细讲解“python基于opencv实现一个鼠标绘图小程序”的完整攻略。简介本文介绍如何使用Python和OpenCV库来实现一个简单的鼠标绘图小程序。主要包含以下步骤：创建窗口绑定鼠标事件绘制图形退出程序创建窗口首先需要导入OpenCV库并创建一个窗口。可以使用cv2.namedWindow()函数来创建一个窗口，并指定窗口的…

python 2023年5月19日
000
Windows上配置Emacs来开发Python及用Python扩展Emacs

Windows上配置Emacs来开发Python及用Python扩展Emacs 在Windows上配置Emacs来开发Python需要进行以下步骤：步骤1：安装Emacs 可以从官网下载最新版本的Emacs： https://www.gnu.org/software/emacs/download.html#windows 步骤2：安装Python 可以从P…

python 2023年6月3日
000

合作推广

合作推广

返回顶部