Python爬虫实例详解

爬虫的基本概念

爬虫是指利用计算机程序自动访问互联网，并从中获取所需信息的一种技术。常见的爬虫应用场景为搜索引擎的抓取，以及各类网站数据的采集与分析。

基本的爬虫流程为：发送请求 -> 解析内容 -> 存储数据。当然，在实际开发中涉及到的细节和问题非常多，下面将通过两个实例进行介绍。

示例一：爬取微博热搜榜

实现步骤

导入所需模块：requests、lxml、pandas；
使用requests发送请求，获取页面源代码；
使用lxml解析页面，并使用XPath获取所需内容；
使用pandas保存结果。

代码示例

import requests
from lxml import html
import pandas as pd

url = 'https://s.weibo.com/top/summary?cate=realtimehot'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
tree = html.fromstring(response.content)
hot_list = tree.xpath('//td[@class="td-01 ranktop"]/text()')
title_list = tree.xpath('//td[@class="td-02"]/a/text()')
for index, hot in enumerate(hot_list):
    print(index + 1, hot, title_list[index])

df_weibo = pd.DataFrame({
    '排名': hot_list,
    '话题': title_list,
})
df_weibo.to_csv('微博热搜榜.csv', index=False, encoding='utf_8_sig')

代码解读

第1行：导入所需模块；
第3~5行：设置请求URL、请求头；
第6行：使用requests发送请求，获取响应内容；
第7行：使用lxml解析页面源代码；
第8~9行：使用XPath获取热搜排名和话题；
第10~13行：打印结果；
第15~19行：使用pandas保存参数并写入csv文件。

示例二：爬取steam上游戏的相关信息

实现步骤

导入所需模块：requests、BeautifulSoup、time；
使用requests发送请求，获取页面源代码；
使用BeautifulSoup解析页面，并获取所需内容；
保存结果，并加入适当延时。

代码示例

import requests
from bs4 import BeautifulSoup
import time

# 请求URL
url = 'https://store.steampowered.com/app/578650/'
# 请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# 获取steam游戏名字
game_name = soup.find('div', class_='apphub_AppName').text.strip()
print("游戏名称：" + game_name)

# 获取steam游戏价格
price_discount = soup.find('div', class_='discount_final_price').text.strip()
if price_discount:
    print("优惠价：" + price_discount)
else:
    price = soup.find('div', class_='game_purchase_price').text.strip()
    if price:
        print("原价：" + price)

# 获取steam游戏评价信息
review = soup.find('span', class_='game_review_summary').text.strip()
rating = soup.find('span', class_='responsive_reviewdesc').text.strip()
print("评价信息：" + review + "，" + rating)

# 加入适当延时
time.sleep(1)

代码解读

第1行：导入所需模块；
第4~6行：设置请求URL、请求头；
第7行：使用requests发送请求，获取响应内容；
第8行：使用BeautifulSoup解析页面源代码；
第11~13行：使用find方法获取游戏名称，并使用strip去除空格；
第15~18行：使用find方法获取游戏价格，判断游戏是否有优惠，并使用strip去除空格；
第20~22行：使用find方法获取游戏评价信息，并使用strip去除空格；
第23行：使用time模块加入适当延时。

总结

通过以上两个案例，我们对于Python爬虫技术有了更加深入的了解。实际开发中，我们常常遇到各种各样的问题，但只要持续学习和实践，相信你一定可以掌握好这门技术。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫实例详解 - Python技术站

python爬虫实例详解

Python爬虫实例详解

爬虫的基本概念

示例一：爬取微博热搜榜

实现步骤

代码示例

代码解读

示例二：爬取steam上游戏的相关信息

实现步骤

代码示例

代码解读

总结

相关文章