python3使用requests模块爬取页面内容的实战演练

当我们想要爬取网页数据时，Python的requests模块可以说是必不可少的一个工具。下面是使用Python3中requests模块爬取页面内容的实战演练的完整攻略。

1. 准备工作

首先，我们需要安装Python的requests模块。在命令行中输入以下命令进行安装：

pip3 install requests

在这里，我们还需要一个网站，作为我们的爬取目标。

假设我们要爬取的网站是这个网站：https://www.jianshu.com/c/bDHhpK

2. 基本使用方法

2.1 发送get请求

我们可以使用requests模块中的get方法来发送get请求，获取网页的内容。

import requests

url = 'https://www.jianshu.com/c/bDHhpK'

response = requests.get(url)

print(response.text)

在上面的代码中，我们使用requests模块发送了一个get请求，并将响应的内容作为字符串打印出来。

2.2 发送post请求

我们也可以使用requests模块发送post请求，获取网页的内容。

import requests

url = 'https://www.jianshu.com/search/do'

data = {
    'q': 'Python',
    'page': '1',
    'type': 'notebook'
}

response = requests.post(url, data=data)

print(response.text)

在上面的代码中，我们使用requests模块发送了一个post请求，并将响应的内容作为字符串打印出来。我们还传递了一个data参数，它包含了我们要发送的数据。

3. 常见问题及解决方法

3.1 UnicodeDecodeError

当我们用requests模块爬取网页内容时，有时候会遇到UnicodeDecodeError的错误。这是因为网页的编码格式可能是其他的编码格式，如gb2312、GBK等，而不是我们常见的utf-8。解决方法是在get或post请求中添加headers参数，将编码格式设置为网页的编码格式。

import requests

url = 'https://www.example.com'

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}

response = requests.get(url, headers=headers)

print(response.content.decode('gb2312'))

在上面的代码中，我们设置了headers参数，并且在decode方法中指定了gb2312编码格式来解码网页内容。

3.2 网页参数动态生成

有些网页的参数是动态生成的，这时我们需要使用一些工具来分析网页，获取相关参数的值。然后在爬虫代码中手动传递这些参数。

例如，我们要爬取微博搜索的相关页面，调用接口时需要先请求https://weibo.com，并抓取其中的一些关键参数。

import requests
from bs4 import BeautifulSoup

url = 'https://weibo.com/'
search_url = 'https://s.weibo.com/weibo?q={}&typeall=1&suball=1&timescope=custom:{}:{}&Refer=g'

session = requests.Session()

# 先请求http://weibo.com/获取一些关键参数
response = session.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
    if 'pl_login_form' in script.text:
        # 获取st参数
        st_index = script.text.index('STK') + 6
        ed_index = script.text.index(',', st_index)
        st = script.text[st_index:ed_index].strip().strip('\'')

        # 获取pcid参数
        pcid_index = script.text.index('PCID') + 7
        ed_index = script.text.index(',', pcid_index)
        pcid = script.text[pcid_index:ed_index].strip().strip('\'')

        headers = {
            'referer': url
        }
        cookies = {
            'SUB': '...',
            'SCF': '...',
            'SSOLoginState': '...',
            'SUHB': '...',
            'ALF': '...',
            'wvr': '6'
        }
        session.headers.update(headers)
        session.cookies.update(cookies)
        response = session.get(search_url.format('Python', '2021-01-01', '2021-01-02'))

        print(response.text)

在上面的代码中，我们使用requests.Session类，这是为了将一些关键参数保存到会话中，方便之后的请求使用。我们还使用了BeautifulSoup来解析网页，找到关键参数的值。然后，我们使用了格式化字符串，将搜索关键字、时间等参数传递到请求中。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python3使用requests模块爬取页面内容的实战演练 - Python技术站