标题：Python实现简易Web爬虫详解

1. 准备工作

在使用Python实现Web爬虫之前，需要安装好Python环境和相关的第三方库，例如Requests和Beautiful Soup。

# 安装requests和beautifulsoup4库
pip install requests
pip install beautifulsoup4

2. 爬取页面

使用Python实现Web爬虫的第一步是获取目标网页的HTML源代码。这可以通过requests库中的get()方法实现。

import requests

url = 'http://www.example.com'
response = requests.get(url)
html = response.text

3. 解析HTML

获取到HTML源代码后，我们需要使用Beautiful Soup库来解析HTML并提取我们需要的信息。

from bs4 import BeautifulSoup

# 创建Beautiful Soup对象
soup = BeautifulSoup(html, 'html.parser')

# 获取网页标题
title = soup.title.string

# 获取所有链接
links = []
for link in soup.find_all('a'):
    links.append(link.get('href'))

4. 存储数据

解析HTML并提取信息后，我们一般会将数据存储到本地或者数据库中。

import json

# 将数据存储到JSON文件中
data = {'title': title, 'links': links}
with open('data.json', 'w') as f:
    json.dump(data, f)

示例一：爬取豆瓣电影Top250

我们可以使用Python实现爬取豆瓣电影Top250，并将电影名、评分和简介存储到本地JSON文件中。

import requests
from bs4 import BeautifulSoup
import json

url = 'https://movie.douban.com/top250'
data = []

def crawl(url):
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    for item in soup.find_all('li', class_='item'):
        data.append({
            'title': item.find('span', class_='title').string,
            'rating': item.find('span', class_='rating_num').string,
            'intro': item.find('span', class_='inq').string
        })

    next_link = soup.find('span', class_='next').find('a')
    if next_link:
        next_url = url + next_link.get('href')
        crawl(next_url)

crawl(url)

with open('douban_top250.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

示例二：爬取天猫商城

我们可以使用Python实现爬取天猫商城中的商品信息，并将商品名称、价格和销量存储到本地CSV文件中。

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://list.tmall.com/search_product.htm?q=iphone'

def crawl(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    for item in soup.find_all('div', class_='product'):
        title = item.find('p', class_='productTitle').string
        price = item.find('p', class_='productPrice').em.string
        sales = item.find('p', class_='productStatus').span.text
        data.append([title, price, sales])

    next_link = soup.find('a', class_='ui-page-next')
    if next_link:
        next_url = 'https://list.tmall.com' + next_link.get('href')
        crawl(next_url)

data = []
crawl(url)

with open('tmall.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['商品名称', '价格', '销量'])
    writer.writerows(data)

以上就是Python实现简易Web爬虫的详细攻略，希望对你有所帮助！

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python实现简易Web爬虫详解 - Python技术站

Python实现简易Web爬虫详解