Python 爬取微博热搜页面

下面是“Python 爬取微博热搜页面”的完整攻略：

1. 前置准备

在开始爬取微博热搜页面之前，我们需要进行以下几个前置准备：

1.1 安装 Python

由于我们使用 Python 进行爬虫开发，所以需要在电脑上安装 Python 环境。建议采用 Python3 版本，你可以从官网下载安装包进行安装。

1.2 安装 requests 库

requests 库可以帮助我们发送 HTTP 请求，并获取响应内容。我们可以在终端中使用以下命令安装 requests：

pip install requests

1.3 安装 BeautifulSoup 库

BeautifulSoup 是 Python 中的一个 HTML 解析库，可以方便地处理 HTML 页面的结构。我们可以在终端中使用以下命令安装 BeautifulSoup：

pip install beautifulsoup4

2. 获取页面源代码

首先，我们需要获取微博热搜页面的源代码。可以通过 requests 库的 get 方法来发送 GET 请求获取页面的 HTML 内容。

import requests

url = 'https://s.weibo.com/top/summary'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
html = response.content.decode('utf-8')

print(html)

由于微博的页面有反爬机制，需要在 headers 中增加 User-Agent 信息，模拟浏览器发送请求。

3. 解析页面内容

得到页面的源代码之后，我们需要使用 BeautifulSoup 解析页面的结构，得到我们需要的内容。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

lis = soup.find_all('li', class_='td-01')

for li in lis:
    a = li.find('a')
    title = a.get_text()
    url = 'https://s.weibo.com' + a['href']
    print(title, url)

在上面的代码中，我们使用 find_all 方法来查找页面中所有 class 为 td-01 的 li 元素。然后在每个 li 元素中，查找第一个 a 元素的文本内容，并拼接出完整的链接地址。最后打印出每个热搜标题和链接地址。

示例说明

下面我们来看两个示例说明：

示例一

假设我们要获取微博热搜中排名前十的热搜标题和链接地址。

import requests
from bs4 import BeautifulSoup

url = 'https://s.weibo.com/top/summary'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
html = response.content.decode('utf-8')

soup = BeautifulSoup(html, 'html.parser')

lis = soup.find_all('li', class_='td-01')

for li in lis[:10]:
    a = li.find('a')
    title = a.get_text()
    url = 'https://s.weibo.com' + a['href']
    print(title, url)

在这个示例中，我们在 for 循环中只打印了 lis 列表中前十个元素，即排名前十的热搜标题和链接地址。

示例二

假设我们要获取微博热搜中包含“疫情”的热搜标题和链接地址。

import requests
from bs4 import BeautifulSoup

url = 'https://s.weibo.com/top/summary'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
html = response.content.decode('utf-8')

soup = BeautifulSoup(html, 'html.parser')

lis = soup.find_all('li', class_='td-01')

for li in lis:
    a = li.find('a')
    title = a.get_text()
    if '疫情' in title:
        url = 'https://s.weibo.com' + a['href']
        print(title, url)

在这个示例中，我们在 for 循环中增加了一个判断条件，只打印热搜标题中包含“疫情”的热搜标题和链接地址。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python 爬取微博热搜页面 - Python技术站