Python进阶篇之多线程爬取网页

简介

本篇文章主要介绍如何利用多线程爬取网页，并通过两个示例来讲解多线程爬取网页的具体操作和注意事项。

多线程爬取网页

多线程是指在一个进程内，启动多个线程来并行执行不同的任务。在爬取网页的过程中，可以使用多线程来提高爬取速度。具体流程如下：

创建多个线程
定义每个线程需要执行的任务
启动线程，开始执行任务
等待所有线程执行完毕

示例1: 多线程爬取图片

下面的示例演示了如何使用多线程爬取图片：

import requests
import os
import threading

def download_img(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as f:
        f.write(response.content)

def main():
    os.makedirs('images', exist_ok=True)
    img_urls = [
        'https://xxx.com/1.jpg',
        'https://xxx.com/2.jpg',
        'https://xxx.com/3.jpg',
        'https://xxx.com/4.jpg',
        'https://xxx.com/5.jpg'
    ]
    threads = []
    for url in img_urls:
        save_path = os.path.join('images', url.split('/')[-1])
        t = threading.Thread(target=download_img, args=(url, save_path))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()

if __name__ == '__main__':
    main()

在上述代码中，download_img函数用于下载单张图片，main函数中利用多线程并行下载多张图片。其中threads列表用于保存所有的线程对象，start方法用于启动线程，join方法用于等待所有线程完成。

示例2: 多线程爬取网页并保存

下面的示例演示了如何使用多线程爬取网页并保存：

import requests
import os
import threading

def save_html(url, save_path):
    response = requests.get(url)
    with open(save_path, 'w') as f:
        f.write(response.text)

def main():
    os.makedirs('html', exist_ok=True)
    urls = [
        'https://xxx.com/1',
        'https://xxx.com/2',
        'https://xxx.com/3',
        'https://xxx.com/4',
        'https://xxx.com/5'
    ]
    threads = []
    for url in urls:
        save_path = os.path.join('html', url.split('/')[-1]+'.html')
        t = threading.Thread(target=save_html, args=(url, save_path))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()

if __name__ == '__main__':
    main()

在上述代码中，save_html函数用于下载单个网页并保存为html文件，main函数中利用多线程并行下载多个网页。其中threads列表用于保存所有的线程对象，start方法用于启动线程，join方法用于等待所有线程完成。

总结

利用多线程可以提高程序的执行效率，适合一些计算量大、IO密集型的任务，如爬取网页。使用多线程需要注意线程安全问题，不能对共享资源进行并发读写。在本文中，我通过两个示例演示了如何使用多线程爬取图片和网页并保存，希望能够对你有所帮助。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python进阶篇之多线程爬取网页 - Python技术站

Python进阶篇之多线程爬取网页

Python进阶篇之多线程爬取网页

简介

多线程爬取网页

示例1: 多线程爬取图片

示例2: 多线程爬取网页并保存

总结

相关文章