Python爬虫中的并发编程详解

在Python爬虫中，为了提高爬虫效率，通常需要使用并发编程。本文将介绍Python爬虫中的并发编程，包括多线程、协程和异步IO等技术。同时，还会提供两个示例讲解。

多线程

多线程是指在一个进程中存在多个线程，每个线程都可以独立执行不同的任务。在Python中，可以使用threading模块实现多线程编程。

下面是一个简单的示例，使用多线程爬取多个网页内容：

import threading
import requests

urls = ['https://www.baidu.com', 'https://www.hao123.com', 'https://www.sogo.com']

def fetch(url):
    response = requests.get(url)
    print(url, response.status_code)

threads = []
for url in urls:
    t = threading.Thread(target=fetch, args=(url,))
    threads.append(t)

for t in threads:
    t.start()

for t in threads:
    t.join()

在上述示例中，首先定义了一个fetch函数，用于发送HTTP请求并打印响应状态码。然后定义了一个urls列表，其中包含要爬取的网页地址。接着，使用threading.Thread类创建多个线程，并将它们添加到threads列表中。最后，分别启动所有线程并等待它们执行完成。

协程

协程是一种更轻量级的线程，由于不需要线程上下文切换的开销，因此协程的并发量通常比多线程要高。在Python中，可以使用asyncio模块实现协程编程。

下面是一个简单的示例，使用协程爬取多个网页内容：

import asyncio
import aiohttp

urls = ['https://www.baidu.com', 'https://www.hao123.com', 'https://www.sogo.com']

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            print(url, response.status)

async def main():
    tasks = [fetch(url) for url in urls]
    await asyncio.gather(*tasks)

asyncio.run(main())

在上述示例中，首先定义了一个fetch协程函数，使用aiohttp库发送HTTP请求并打印响应状态码。然后定义了一个urls列表，其中包含要爬取的网页地址。接着，使用asyncio.gather()函数将多个协程任务合并为一个main协程任务，并使用asyncio.run()函数运行它。

异步IO

异步IO是一种高效的IO模型，它允许程序在等待IO操作完成时执行其他任务。在Python中，可以使用asyncio模块实现异步IO编程。

下面是一个简单的示例，使用异步IO爬取多个网页内容：

import asyncio
import aiohttp

urls = ['https://www.baidu.com', 'https://www.hao123.com', 'https://www.sogo.com']

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def main():
    tasks = [fetch(url) for url in urls]
    responses = await asyncio.gather(*tasks)
    for url, response in zip(urls, responses):
        print(url, len(response))

asyncio.run(main())

在上述示例中，首先定义了一个fetch协程函数，使用aiohttp库发送HTTP请求并返回响应内容。然后定义了一个urls列表，其中包含要爬取的网页地址。接着，使用asyncio.gather()函数将多个协程任务合并为一个main协程任务，并使用asyncio.run()函数运行它。最后，将每个网页的URL和响应内容长度打印出来。

示例说明

示例1：使用多线程下载图片

假设需要从多个网站下载图片，为了提高下载速度，可以使用多线程进行并发下载。可以定义一个download函数用于下载图片，然后使用多线程启动多个download函数进行下载。

import requests
import threading

urls = [
    'https://www.baidu.com/img/PCfb_5bf082d29588c07f842ccde3f97243ea.png',
    'https://www.hao123.com/static/img/newindex/logo_efe5aabd.png',
    'https://www.sogou.com/images/logo/new/sogou.png'
]

def download(url):
    response = requests.get(url)
    filename = url.split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f'{url}下载完成')

threads = []
for url in urls:
    t = threading.Thread(target=download, args=(url,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

在上述示例中，首先定义了一个download函数，用于下载指定URL的图片，并将其保存到本地文件。然后定义了一个urls列表，其中包含要下载的图片URL。接着，使用多线程启动多个download函数进行下载，并等待所有线程执行完成。

示例2：使用协程爬取豆瓣电影Top250

假设需要爬取豆瓣电影Top250页面，并获取每部电影的名称、评分和简介。可以先分析页面结构，然后使用aiohttp和asyncio库进行异步IO爬取。

import aiohttp
import asyncio
from lxml import etree

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def parse(html):
    tree = etree.HTML(html)
    items = tree.xpath('//div[@class="info"]')
    for item in items:
        title = item.xpath('.//span[@class="title"]/text()')[0]
        rating = item.xpath('.//span[@class="rating_num"]/text()')[0]
        desc = item.xpath('.//span[@class="inq"]/text()')[0]
        print(f'{title}{rating}  {desc}')

async def main():
    async with aiohttp.ClientSession() as session:
        for i in range(0, 250, 25):
            url = f'https://movie.douban.com/top250?start={i}&filter='
            html = await fetch(session, url)
            await parse(html)

asyncio.run(main())

在上述示例中，首先定义了一个fetch协程函数，使用aiohttp库发送HTTP请求并返回响应内容。然后定义了一个parse协程函数，使用lxml库解析HTML页面，并提取电影名称、评分和简介等信息。接着，使用asyncio.gather()函数将多个协程任务合并为一个main协程任务，并使用asyncio.run()函数运行它。最后，根据豆瓣电影Top250页面的分页规则，循环下载每一页的HTML并解析获取电影信息。

以上是Python爬虫中的并发编程详解过程，包括多线程、协程和异步IO等技术，并提供了两个示例说明。理解并掌握这些技术，可以提高爬虫效率，缩短爬取时间。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫中的并发编程详解 - Python技术站

Python爬虫中的并发编程详解

Python爬虫中的并发编程详解

多线程

协程

异步IO

示例说明

相关文章