Python爬虫包 BeautifulSoup 递归抓取实例详解

下面开始详细讲解“Python爬虫包 BeautifulSoup 递归抓取实例详解”。

1. 前言

为了更好的理解本文内容，你需要有一定的 Python 编程基础和 HTML 基础。如果你还不了解，可以先去了解一下。

在本文中，我们将使用 BeautifulSoup 这个 Python 爬虫包来实现递归抓取目标数据的功能。递归抓取的含义是：不断的按照某一规律进入下一级别页面中，一直抓到不能进入下一级别页面为止。

2. 环境搭建

首先，我们需要在本地或者服务器上搭建 Python 环境。可以去 Python官网下载安装对应版本的 Python 。

然后，我们需要使用 pip 来安装 BeautifulSoup 和请求库 requests。使用以下指令即可安装：

pip install beautifulsoup4
pip install requests

3. BeautifulSoup 简介

BeautifulSoup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。

在爬虫中，我们主要用到 BeautifulSoup 中的 find() 和 find_all() 方法。这两个方法可以根据标签名、属性名等来查找 HTML 页面中我们所需要的数据。例如：

from bs4 import BeautifulSoup
import requests

url = 'http://www.example.com/'
response = requests.get(url)
html_doc = response.text
soup = BeautifulSoup(html_doc, 'html.parser')

# 查找第一个<p>标签
p = soup.find('p')

# 查找所有的<p>标签
ps = soup.find_all('p')

# 查找 class 属性为 "content" 的布局标签
div = soup.find('div', class_='content')

使用上述代码，我们就可以查询出相应的 HTML 标签内容，并进一步提取需要的信息。

4. 递归抓取实例

下面，我们将通过两个实例来演示如何使用 BeautifulSoup 实现递归抓取的功能。

实例一：爬取简书某个用户的所有笔记

首先，我们需要将目标用户的首页的文章链接全部获取到，然后再一篇一篇文章内容页面获取当前页中文章的信息。

from bs4 import BeautifulSoup
import requests

def get_article_urls(user_id, page):
    """
    获取某个用户第page页的所有文章链接
    """
    url = f'https://www.jianshu.com/u/{user_id}?order_by=shared_at&page={page}'
    response = requests.get(url)
    html_doc = response.text
    soup = BeautifulSoup(html_doc, 'html.parser')

    article_elements_list = soup.find_all('a', class_='title')
    article_urls = [article['href'] for article in article_elements_list]

    return article_urls

def get_article_info(article_url):
    """
    获取一篇文章的信息
    """
    response = requests.get(article_url)
    html_doc = response.text
    soup = BeautifulSoup(html_doc, 'html.parser')

    title = soup.find('h1', class_='title').text
    author = soup.find('div', class_='author').find('a', class_='name').text
    content = soup.find('div', class_='show-content-free').text.strip()

    article_info = {
        'title': title,
        'author': author,
        'content': content
    }

    return article_info

def spider_article(user_id, start_page, end_page):
    """
    抓取目标用户的文章信息
    """
    for page in range(start_page, end_page):
        article_urls = get_article_urls(user_id, page)
        if len(article_urls) == 0:
            break

        for article_url in article_urls:
            article_info = get_article_info(article_url)
            print(article_info)

spider_article('123456789', 1, 5)

上述代码实现了一个抓取简书用户笔记的功能。我们通过 get_article_urls 获取某个用户指定页的所有文章链接，然后通过 get_article_info 方法获取每篇文章的详细信息，最后通过 spider_article 方法来循环获取指定页码区间的文章信息。

值得注意的是，在实际爬取的过程中，我们需要做好反爬虫策略，对爬虫进行合理的伪装以避免被网站屏蔽。

实例二：网站地图整站爬取

我们也可以通过递归抓取的方式来完成对整站的爬取。下面是一个简化版的实现代码：

import requests
from bs4 import BeautifulSoup
import re

already_crawled_url = set()  # 记录已经爬过的网址

def get_links(url):
    """
    获取 HTML 页面中的所有链接
    """
    response = requests.get(url)
    html_doc = response.text
    soup = BeautifulSoup(html_doc, 'html.parser')
    links = soup.find_all('a', href=re.compile("http|https"))  
    return links

def dfs_crawl(url):
    """
    递归爬取
    """
    already_crawled_url.add(url)
    print(url)

    links = get_links(url)
    for link in links:
        link_url = link.attrs['href']
        # 如果链接已经爬过了，不再重复爬取
        if link_url not in already_crawled_url:
            dfs_crawl(link_url)

dfs_crawl('http://www.example.com')

使用该代码，我们可以在控制台打印出 http://www.example.com 这个网站下所有的链接。这里使用了 DFS 算法，将所有的链接递归抓取。需要注意的是，在实际应用中，我们需要进行限制，避免不必要的死循环等问题出现，同时也要合理规划爬虫的爬取深度。

5. 总结

本文主要讲解了如何使用 BeautifulSoup 实现递归抓取某个页面的功能，并通过两个实际应用的实例加深了理解。当然，在实际应用中，我们还需要要注意其他方面的问题，例如身份伪装、数据清洗等问题。希望本文能对你的工作有所帮助。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫包 BeautifulSoup 递归抓取实例详解 - Python技术站