Python采集腾讯新闻实例

Python采集腾讯新闻实例可以分为以下几个步骤：

确定采集目标：确定要采集的网页的URL以及需要采集的内容。
获取网页源代码：使用Python的requests库向目标URL发送GET请求，获取网页的HTML源代码。
解析网页源代码：使用Python的BeautifulSoup库将HTML源代码解析成一个BeautifulSoup对象，方便后续操作。
提取目标内容：通过分析HTML结构，使用BeautifulSoup提供的查找和过滤方法提取目标内容。
保存数据：将提取的目标内容保存为CSV、JSON或数据库等格式。

以下是两个示例说明：

示例1：采集腾讯新闻列表页

腾讯新闻的列表页为https://news.qq.com/，我们需要获取该页面上的所有新闻标题和链接。

使用requests库向目标URL发送GET请求：

import requests
url = 'https://news.qq.com/'
response = requests.get(url)

使用BeautifulSoup库将HTML源代码解析成一个BeautifulSoup对象：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

使用find_all方法查找新闻标题和链接的标签：

news_list = soup.find_all('a', class_="text")

提取新闻标题和链接：

result = []
for news in news_list:
    title = news.text
    link = news['href']
    result.append({'title': title, 'link': link})

将提取的新闻标题和链接保存为CSV格式：

import csv
with open('news.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'link'])
    writer.writeheader()
    writer.writerows(result)

示例2：采集腾讯新闻详情页

我们需要获取腾讯新闻详情页（例如https://new.qq.com/omn/20210907/20210907A0GN9I00.html）的标题、发布时间、作者和内容。

使用requests库向目标URL发送GET请求：

import requests
url = 'https://new.qq.com/omn/20210907/20210907A0GN9I00.html'
response = requests.get(url)

使用BeautifulSoup库将HTML源代码解析成一个BeautifulSoup对象：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

使用find方法查找标题、发布时间、作者和内容的标签：

title = soup.find('h1', class_='content-article').text
pub_time = soup.find('span', class_='article-time').text.strip()
author = soup.find('span', class_='author-name').text
content = soup.find('div', class_='content-article').text.strip()

将提取的标题、发布时间、作者和内容保存为JSON格式：

import json
with open('news.json', 'w', encoding='utf-8') as f:
    json.dump({'title': title, 'pub_time': pub_time, 'author': author, 'content': content}, f, ensure_ascii=False)

以上就是Python采集腾讯新闻实例的完整攻略。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python采集腾讯新闻实例 - Python技术站

Python采集腾讯新闻实例

相关文章