python 开心网和豆瓣日记爬取的小爬虫

Python 开心网和豆瓣日记爬取的小爬虫是一个比较简单的网页抓取程序，用于获取指定网站的日记文章，并将其存储到本地文件中。本文将阐述该小爬虫的完整攻略，包括实现的步骤和示例说明。

环境准备

在实现该小爬虫之前，需要先安装 Python 3.x 和 requests 库以及 BeautifulSoup 库。其中，requests 库和 BeautifulSoup 库用于发送 HTTP 请求和解析 HTML 页面中的内容。在安装完成后，可以在命令行中测试以下命令，确保已经成功安装：

python --version
pip install requests
pip install beautifulsoup4

实现步骤

了解目标网站的 URL 结构

在开始实现之前，需要先了解目标网站的 URL 结构。以开心网日记为例，其日记文章的 URL 类似于 http://www.kaixin001.com/!home/blog/get_articles.php?page=1&uid=123456，其中 &page=1 代表文章列表的页码，而 &uid=123456 则代表作者的唯一 ID。在实现时，需要将 page 参数的值进行循环遍历，以获取所有文章列表页。

发送 HTTP 请求获取文章列表

在了解了 URL 结构之后，可以使用 requests 库发送 HTTP GET 请求，获取文章列表的 HTML 页面。示例代码如下：

import requests

url = 'http://www.kaixin001.com/!home/blog/get_articles.php?page=1&uid=123456'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)

需要注意的是，为了避免被目标网站识别出为爬虫程序，需要设置正确的 User-Agent，以模拟正常用户的浏览行为。可以在浏览器中打开目标网页，在开发者模式下查看 Network 列表，找到合适的 User-Agent 进行设置。

解析 HTML 页面中的文章列表

使用 BeautifulSoup 库可以方便地解析 HTML 页面中的内容。在这里，需要使用该库解析出 HTML 页面中具体文章的标题、作者、发表时间、以及文章详情页的 URL。示例代码如下：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('div', class_='article_title')

for article in articles:
    title = article.find('a').text
    url = article.find('a').get('href')
    author = article.find_next_sibling('div', class_='article_title_author').find('a').text
    create_time = article.find_next_sibling('div', class_='article_title_author').find('span').text

    # 下一步：使用文章详情页的 URL 发送新的 HTTP 请求获取文章内容
    # ...

发送 HTTP 请求获取文章详情

根据上一步获取到的文章详情页 URL，可以使用 requests 库再次发送 HTTP GET 请求，获取具体文章的内容。示例代码如下：

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
content = soup.find('div', class_='article_body').text

其中，article_body 是文章详情页 HTML 页面中文章内容所在的 div 元素的 class 名称，可以通过浏览器中的开发者模式查看相应文本内容对应的元素。

存储文章内容到本地文件

最后，可以将获取到的文章内容存储到本地文件中。以 .txt 文件为例，可以使用 Python 的文件操作模块 os 和 io，创建一个新文件并写入文章内容。示例代码如下：

import os
import io

filename = title + '.txt'
filepath = os.path.join('.', 'articles', filename)
with io.open(filepath, 'w', encoding='utf-8') as file:
    file.write(content)

其中，filename 是文章标题加 .txt 后缀生成的文件名，filepath 则为该文件的完整路径，其中 articles 为存放文章文件的文件夹名称。可以通过当前工作目录中存在该文件夹进行存储，使用 os.path.join() 来生成完整的路径名。

示例说明

以豆瓣日记爬取为例，完整代码如下：

import os
import io
import requests
from bs4 import BeautifulSoup

# 目标作者的豆瓣 ID
douban_user_id = 123456

# 发送 HTTP 请求获取文章列表
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
url = f'https://www.douban.com/people/{douban_user_id}/notes'
response = requests.get(url, headers=headers)

# 解析 HTML 页面中的文章列表
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('div', class_='title')

for article in articles:
    title = article.find('a').text
    url = article.find('a').get('href')

    # 发送 HTTP 请求获取文章详情
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    content = soup.find('div', class_='note')

    # 存储文章内容到本地文件
    filename = title + '.txt'
    filepath = os.path.join('.', 'articles', filename)
    with io.open(filepath, 'w', encoding='utf-8') as file:
        file.write(str(content))

在该示例中，直接将豆瓣日记爬取的 URL 拼接完成，获取文章列表中的文章标题和详情页 URL，并保存到本地文件夹 articles 中。需要注意的是，在获取文章详情时，该示例中使用了 str(content) 将整个 HTML 页面内容保存到文件中，可能会带来一些不必要的格式问题。

以开心网日记爬取为例，需要进行多页的文章列表爬取，完整代码如下：

import os
import io
import requests
from bs4 import BeautifulSoup

# 目标作者的开心网唯一 ID
kx_uid = '123456'

# 发送 HTTP 请求获取多页文章列表
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
for page in range(1, 3):
    url = f'http://www.kaixin001.com/!home/blog/get_articles.php?page={page}&uid={kx_uid}'
    response = requests.get(url, headers=headers)

    # 解析 HTML 页面中的文章列表
    soup = BeautifulSoup(response.text, 'html.parser')
    articles = soup.find_all('div', class_='article_title')

    for article in articles:
        title = article.find('a').text
        url = article.find('a').get('href')
        author = article.find_next_sibling('div', class_='article_title_author').find('a').text
        create_time = article.find_next_sibling('div', class_='article_title_author').find('span').text

        # 发送 HTTP 请求获取文章详情
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        content = soup.find('div', class_='content')

        # 存储文章内容到本地文件
        filename = title + '.txt'
        filepath = os.path.join('.', 'articles', filename)
        with io.open(filepath, 'w', encoding='utf-8') as file:
            file.write(str(content))

在该示例中，使用了 Python 的 range() 函数对文章列表页面进行循环遍历，以获取多页文章。为了方便测试，该示例中仅获取了前 2 页的文章列表。需要注意的是，在文章详情页中，开心网的文章内容并不会直接显示在页面中，而是在页面中加载 JS 文件异步获取。因此在该示例中，需要在浏览器中手动进入文章页，复制文章内容相关标签，确定正则表达式等内容进行文章内容的爬取。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python 开心网和豆瓣日记爬取的小爬虫 - Python技术站

python 开心网和豆瓣日记爬取的小爬虫

环境准备

实现步骤

示例说明

相关文章