使用Python实现简单的爬虫框架

下面我会详细讲解如何使用Python实现简单的爬虫框架，在整个过程中，我们将会遵循一个完整的攻略步骤来进行。这里分为以下几个部分来讲解：

确定目标 & 安装必要的库

首先，我们需要明确爬取的目标网站，并且选择一个适合的爬虫库。在Python中，比较常用的爬虫库有Requests和BeautifulSoup4。前者常用于发送HTTP请求并获得响应，后者常用于解析HTML和XML等文本。

安装Requests库：

pip install requests

安装BeautifulSoup4库：

pip install beautifulsoup4

在本文中，我们选择爬取豆瓣电影的数据。

获取网页内容

通过requests库发送HTTP请求获取豆瓣电影排行榜的HTML代码：

import requests

url = "https://movie.douban.com/top250"

# 发送 GET 请求获得 HTML 响应
response = requests.get(url)

# 打印 HTTP 响应状态码
print(response.status_code)

# 打印 HTML 代码
print(response.text)

此时，我们已经成功获取了豆瓣电影的HTML代码。可以从中提取出需要的信息。

解析HTML代码

通过BeautifulSoup4库解析HTML代码，获取需要的信息：

from bs4 import BeautifulSoup

# 将 HTML 代码解析为 BeautifulSoup 对象
soup = BeautifulSoup(response.text, "html.parser")

# 获取电影列表
movie_list = soup.find("ol", class_="grid_view").find_all("li")

# 遍历电影列表，获取电影标题、星级、评价人数、短评等信息
for movie in movie_list:
    title = movie.find("span", class_="title").text.strip()
    star = movie.find("span", class_="rating_num").text.strip()
    comments = movie.find("div", class_="star").find_all("span")[3].text.strip()
    quote = movie.find("span", class_="inq").text.strip()

    print(title, star, comments, quote)

此时，我们已经成功解析了HTML代码，并从中提取出需要的信息，将其打印输出即可。

实现简单的爬虫框架

下面，我们将上述两个步骤封装成一个简单的爬虫框架，以便于多次调用。

import requests
from bs4 import BeautifulSoup

def get_html(url):
    # 发送 GET 请求获得 HTML 响应
    response = requests.get(url)
    # 判断请求是否成功
    if response.status_code == 200:
        return response.text
    else:
        return None

def parse_html(html):
    # 将 HTML 代码解析为 BeautifulSoup 对象
    soup = BeautifulSoup(html, "html.parser")
    # 获取电影列表
    movie_list = soup.find("ol", class_="grid_view").find_all("li")
    # 遍历电影列表，获取电影标题、星级、评价人数、短评等信息
    for movie in movie_list:
        title = movie.find("span", class_="title").text.strip()
        star = movie.find("span", class_="rating_num").text.strip()
        comments = movie.find("div", class_="star").find_all("span")[3].text.strip()
        quote = movie.find("span", class_="inq").text.strip()
        yield {"title": title, "star": star, "comments": comments, "quote": quote}

def main():
    url = "https://movie.douban.com/top250"
    html = get_html(url)
    if html:
        for item in parse_html(html):
            print(item)

if __name__ == "__main__":
    main()

通过上述代码，我们已经实现了一个简单的爬虫框架。每次调用main()函数即可获取豆瓣电影排行榜的信息。

这里再分享两个示例说明：

示例1：获取京东商品的价格信息

import requests
from bs4 import BeautifulSoup

def get_html(url):
    # 发送 GET 请求获得 HTML 响应
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    # 判断请求是否成功
    if response.status_code == 200:
        return response.text
    else:
        return None

def parse_html(html):
    # 将 HTML 代码解析为 BeautifulSoup 对象
    soup = BeautifulSoup(html, "html.parser")
    # 获取价格信息
    price = soup.find("span", class_="p-price").find("i").text
    yield {"price": price}

def main():
    url = "https://item.jd.com/100020357988.html"
    html = get_html(url)
    if html:
        for item in parse_html(html):
            print(item)

if __name__ == "__main__":
    main()

示例2：获取知乎答案中被赞同的字数

import requests
from bs4 import BeautifulSoup

def get_html(url):
    # 发送 GET 请求获得 HTML 响应
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    # 判断请求是否成功
    if response.status_code == 200:
        return response.text
    else:
        return None

def parse_html(html):
    # 将 HTML 代码解析为 BeautifulSoup 对象
    soup = BeautifulSoup(html, "html.parser")
    # 获取赞同数
    upvotes = soup.find("button", class_="Button VoteButton VoteButton--up").find("span", class_="Icon ContentItemVoteArrowUp").find_next_sibling("span").text
    yield {"upvotes": upvotes}

def main():
    url = "https://www.zhihu.com/question/48285414/answer/210963224"
    html = get_html(url)
    if html:
        for item in parse_html(html):
            print(item)

if __name__ == "__main__":
    main()

以上便是使用Python实现简单的爬虫框架的完整攻略，希望能够对你有所帮助。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：使用Python实现简单的爬虫框架 - Python技术站

使用Python实现简单的爬虫框架

相关文章