python并发爬虫实用工具tomorrow实用解析

介绍

tomorrow 是一个使用 python 开发的并发爬虫工具，可以实现简单的多线程/多进程执行代码，并且非常易于使用。这个工具的特点就是：它能够自动将一个函数转化为一个线程或进程，并且允许你设置线程和进程池的大小。在使用 tomorrow 来实现爬虫的时候，我们只需要将爬虫函数用 @tomorrow.thread 或 @tomorrow.process 修饰器包装即可。

步骤

安装tomorrow

首先，我们需要通过 pip 安装 tomorrow：

  ```
  pip install tomorrow
  ```

导入tomorrow模块

导入 tomorrow 模块：

  ```python
  from tomorrow import threads, process
  ```

编写爬虫代码并用 tomorrow 修饰器进行包装

假设我们有一个简单的爬虫函数 crawl(url)，该函数使用 requests 库访问 url 并返回响应结果，我们可以用 @tomorrow.thread 修饰器将这个函数转化为一个线程：

  ```python
  import requests
  from tomorrow import threads

  @threads(5)
  def crawl(url):
      res = requests.get(url)
      return res.text
  ```

如果我们希望将这个函数转化为一个进程，在函数名前加上 @tomorrow.process 即可：

  ```python
  import requests
  from tomorrow import process

  @process(pool_size=3)
  def crawl(url):
      res = requests.get(url)
      return res.text
  ```

调用函数

现在我们就可以通过调用这个函数来进行爬虫了，比如我们可以通过 ThreadPoolExecutor 来实现一个线程池：

  ```python
  import concurrent.futures

  urls = ['http://www.example.com/1', 'http://www.example.com/2', 'http://www.example.com/3']

  with concurrent.futures.ThreadPoolExecutor() as executor:
      results = executor.map(crawl, urls)

  for result in results:
      print(result)
  ```

如果我们希望使用多个进程进行爬虫，可以使用 ProcessPoolExecutor：

  ```python
  import concurrent.futures

  urls = ['http://www.example.com/1', 'http://www.example.com/2', 'http://www.example.com/3']

  with concurrent.futures.ProcessPoolExecutor() as executor:
      results = executor.map(crawl, urls)

  for result in results:
      print(result)
  ```

这里我们使用了 executor.map() 来对 urls 中的每个网址调用 crawl 函数并返回结果，在这里 results 是一个迭代器，它包含了 crawl 函数返回的结果。我们可以通过遍历结果来查看爬虫的结果。

示例说明

下面给出两个示例说明，第一个用来下载图片，借助 tomorrow 的多线程机制提高下载效率，第二个示例模拟访问百度查询并爬取相关网页的链接：

下载图片

```python
import requests
import urllib.request
from tomorrow import threads

@threads(5)
def download_img(url):
try:
print("正在下载: ", url)
response = requests.get(url, timeout=30, headers=headers)
if response.status_code == 200:
with open(name, 'wb') as f:
f.write(response.content) # 写入图片
else:
print("下载失败：", url)
response.close()
except requests.exceptions.RequestException as e:
print("下载失败：", url, " 错误原因：", str(e))

if name == "main":
urls = [
"https://XXXXXX/1.jpg",
"https://YYYYYY/2.jpg",
"https://ZZZZZZ/3.jpg"
]
```
  for url in urls:
      download_img(url)
```
```

根据实测，单线程下载图片的速度是 19.0 s，使用 @threads(2) 修饰器之后下载速度变为 10.4s。

模拟查询并爬取百度搜索页

```python
import requests
from tomorrow import threads

headers = {
'User-agent': 'xxxxx',
'Cookie': 'yyyyy'
}

@threads(10)
def get_search(url):
try:
response = requests.get(url, timeout=10, headers=headers)
if response.status_code != 200:
print("请求失败：", url)
return
html = response.text
# 解析网页
...
response.close()
except requests.exceptions.RequestException as e:
print("请求失败：", url, " 错误原因：", str(e))

if name == "main":
queries = [
"python tomorrow",
"python 并发",
"爬虫新手"
]
```
  for query in queries:
      url = "https://www.baidu.com/s?wd={}".format(query)
      get_search(url)
```
```