Python7个爬虫小案例详解中篇攻略

简介

本文介绍了《Python7个爬虫小案例详解》的中篇，涉及到的7个爬虫小案例分别是：爬取糗事百科段子、爬取妹子图、爬取当当图书、爬取百度百科、爬取链家租房信息、爬取香港天文台天气预报和爬取斗鱼直播。本文将对这些案例进行详细讲解，并附上源码供参考。

篇章内容

爬取糗事百科段子

本案例涉及到的技术点主要有：requests库、xpath、正则表达式。通过requests库获取糗事百科网页内容，然后利用xpath解析出需要的内容，再通过正则表达式对结果进行进一步处理，最后将结果输出保存。

示例代码：

```python
import requests
from lxml import etree

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get('https://www.qiushibaike.com/text/', headers=headers)
html = response.content.decode('utf-8')
selector = etree.HTML(html)

content_list = selector.xpath('//div[@class="article block untagged mb15 typs_hot"]')

for content in content_list:
text = content.xpath('./div[@class="content"]/span/text()')[0]
text = text.strip()
print(text)
```

爬取妹子图

本案例涉及到的技术点主要有：requests库、xpath、正则表达式、多线程。通过requests库获取妹子图网页内容，然后利用xpath解析出需要的内容，再通过正则表达式对结果进行进一步处理，最后将结果输出保存。由于获取妹子图需要爬取大量图片，为了提高爬取效率，本案例使用了多线程技术。

示例代码：

```python
import requests
from lxml import etree
import re
import os
from queue import Queue
from threading import Thread

class Producer(Thread):
def init(self, url_queue, img_queue, headers):
super().init()
self.url_queue = url_queue
self.img_queue = img_queue
self.headers = headers

   def run(self):
       while True:
           if self.url_queue.empty():
               break
           url = self.url_queue.get()
           response = requests.get(url, headers=self.headers)
           html = response.content.decode('utf-8')
           selector = etree.HTML(html)
           img_list = selector.xpath('//div[@class="pic"]/a/img/@src')
           self.img_queue.put(img_list)

class Consumer(Thread):
def init(self, url_queue, img_queue, headers, path):
super().init()
self.url_queue = url_queue
self.img_queue = img_queue
self.headers = headers
self.path = path

   def run(self):
       while True:
           if self.img_queue.empty() and self.url_queue.empty():
               break
           img_list = self.img_queue.get()
           for img_url in img_list:
               response = requests.get(img_url, headers=self.headers)
               img_name = re.findall(r'/(\w+\.jpg$)', img_url)[0]
               img_path = os.path.join(self.path, img_name)
               with open(img_path, 'wb') as f:
                   f.write(response.content)

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

url_queue = Queue()
for i in range(1, 10):
url = 'http://www.meizitu.com/a/list_1_{}.html'.format(i)
url_queue.put(url)

img_queue = Queue()

path = './images'
if not os.path.exists(path):
os.mkdir(path)

producers = [Producer(url_queue, img_queue, headers) for i in range(10)]
consumers = [Consumer(url_queue, img_queue, headers, path) for i in range(10)]

for p in producers:
p.start()

for c in consumers:
c.start()

for p in producers:
p.join()

for c in consumers:
c.join()
```

总结

本文讲解了《Python7个爬虫小案例详解》中篇的完整攻略，包括了爬取糗事百科段子和爬取妹子图的示例说明。在实际使用时，需要根据自己的需求进行修改和优化，但是本文提供的示例可以作为一个基础框架供参考。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python7个爬虫小案例详解(附源码)中篇 - Python技术站

Python7个爬虫小案例详解(附源码)中篇

Python7个爬虫小案例详解中篇攻略

简介

篇章内容

总结

相关文章