python3制作捧腹网段子页爬虫

下面是关于“python3制作捧腹网段子页爬虫”的完整攻略：

一、准备工作

1. 安装Python3

首先需要安装Python3，可以到官网下载安装包。

2. 安装第三方库requests和BeautifulSoup4

在Python中我们可以通过第三方库来实现网页爬虫，这里我们使用requests和BeautifulSoup4两个库，需要先安装：

pip install requests
pip install beautifulsoup4

二、分析网页结构

在进行网页爬取之前，需要对所需数据所在的网页进行分析，主要有以下几个步骤：

1. 打开网页

使用requests模块的get方法可以获取网页的HTML代码，实现方法如下：

import requests

url = 'http://www.pengfu.com/xiaohua_1.html'
response = requests.get(url)
print(response.text)

2. 解析HTML代码

使用BeautifulSoup4模块可以解析HTML代码，将其转换成树形结构，方便后续提取数据。

import requests
from bs4 import BeautifulSoup

url = 'http://www.pengfu.com/xiaohua_1.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

3. 查找数据

通过观察网页代码，确定需要提取的数据所在的标签及其属性。例如，捧腹网的段子页中每个段子都是包含在<div>标签中，并且有class属性为content-img clearfix pt10 relative。

<div class="content-img clearfix pt10 relative">
    <h1>每个人都有一颗明亮的星星……</h1>
    <div class="content-txt pt10">
        <a href="/xiaohua/202003/3547640.html" target="_blank" class="link">
            <img src="/uploads/allimg/200324/1-200324142123.jpg" height="170" width="170">
            <span class="content-span"></span>
        </a>
        小兔子问母兔子：“每个人都有一颗明亮的星星，可是我没有一颗明亮的星星，这是怎么回事啊？”
    </div>
</div>

三、提取数据

确定需要提取的数据所在的标签及其属性后，可以使用BeautifulSoup4提供的方法进行提取，如果需要获取多个数据，可以使用find_all方法，返回一个列表。

例如，获取第一页所有的段子内容和图片链接，实现代码如下：

import requests
from bs4 import BeautifulSoup

url = 'http://www.pengfu.com/xiaohua_1.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.find_all('div', class_='content-img clearfix pt10 relative')
for item in items:
    title = item.h1.text.strip()  # 获取段子标题
    content = item.find('div', class_='content-txt').text.strip()  # 获取段子内容
    img_url = item.img['src']  # 获取该段子的图片链接
    print(title, content, img_url)

四、翻页爬取

如果需要爬取多页内容，可以通过修改URL来实现，例如要获取前10页的段子，可以这样更改URL：http://www.pengfu.com/xiaohua_1_0.html 到 http://www.pengfu.com/xiaohua_1_9.html。

具体代码如下：

import requests
from bs4 import BeautifulSoup

for i in range(10):
    url = 'http://www.pengfu.com/xiaohua_1_{}.html'.format(i)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.find_all('div', class_='content-img clearfix pt10 relative')
    for item in items:
        title = item.h1.text.strip()
        content = item.find('div', class_='content-txt').text.strip()
        img_url = item.img['src']
        print(title, content, img_url)

以上就是关于“python3制作捧腹网段子页爬虫”的完整攻略了，希望能对你有所帮助。

示例1：

如果我们要获取每个段子的评论数和点赞数，可以通过在段子链接后加上.html获取该段子详情页的HTML代码，再通过解析HTML获取评论数和点赞数。

import requests
from bs4 import BeautifulSoup

for i in range(10):
    url = 'http://www.pengfu.com/xiaohua_1_{}.html'.format(i)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.find_all('div', class_='content-img clearfix pt10 relative')
    for item in items:
        title = item.h1.text.strip()
        content = item.find('div', class_='content-txt').text.strip()
        img_url = item.img['src']

        # 获取段子详情页HTML代码
        detail_url = item.a['href']
        detail_response = requests.get(detail_url)
        detail_soup = BeautifulSoup(detail_response.text, 'html.parser')
        num_list = detail_soup.find_all('span', class_='stats-vote')
        vote_num = num_list[0].i.text  # 获取点赞数
        comment_num = num_list[1].i.text  # 获取评论数

        print(title, content, img_url, vote_num, comment_num)

示例2：

如果我们想将数据保存到本地文件中，可以使用Python内置的的文件操作方法进行实现。

import requests
from bs4 import BeautifulSoup

with open('pengfu_jokes.txt', 'w', encoding='utf-8') as f:
    for i in range(10):
        url = 'http://www.pengfu.com/xiaohua_1_{}.html'.format(i)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        items = soup.find_all('div', class_='content-img clearfix pt10 relative')
        for item in items:
            title = item.h1.text.strip()
            content = item.find('div', class_='content-txt').text.strip()
            img_url = item.img['src']

            # 获取段子详情页HTML代码
            detail_url = item.a['href']
            detail_response = requests.get(detail_url)
            detail_soup = BeautifulSoup(detail_response.text, 'html.parser')
            num_list = detail_soup.find_all('span', class_='stats-vote')
            vote_num = num_list[0].i.text  # 获取点赞数
            comment_num = num_list[1].i.text  # 获取评论数

            f.write(title + '\n')
            f.write(content + '\n')
            f.write(img_url + '\n')
            f.write(vote_num + '\n')
            f.write(comment_num + '\n')
            f.write('\n')

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python3制作捧腹网段子页爬虫 - Python技术站