Python7个爬虫小案例详解中篇攻略
简介
本文介绍了《Python7个爬虫小案例详解》的中篇,涉及到的7个爬虫小案例分别是:爬取糗事百科段子、爬取妹子图、爬取当当图书、爬取百度百科、爬取链家租房信息、爬取香港天文台天气预报和爬取斗鱼直播。本文将对这些案例进行详细讲解,并附上源码供参考。
篇章内容
- 爬取糗事百科段子
本案例涉及到的技术点主要有:requests库、xpath、正则表达式。通过requests库获取糗事百科网页内容,然后利用xpath解析出需要的内容,再通过正则表达式对结果进行进一步处理,最后将结果输出保存。
示例代码:
```python
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get('https://www.qiushibaike.com/text/', headers=headers)
html = response.content.decode('utf-8')
selector = etree.HTML(html)
content_list = selector.xpath('//div[@class="article block untagged mb15 typs_hot"]')
for content in content_list:
text = content.xpath('./div[@class="content"]/span/text()')[0]
text = text.strip()
print(text)
```
- 爬取妹子图
本案例涉及到的技术点主要有:requests库、xpath、正则表达式、多线程。通过requests库获取妹子图网页内容,然后利用xpath解析出需要的内容,再通过正则表达式对结果进行进一步处理,最后将结果输出保存。由于获取妹子图需要爬取大量图片,为了提高爬取效率,本案例使用了多线程技术。
示例代码:
```python
import requests
from lxml import etree
import re
import os
from queue import Queue
from threading import Thread
class Producer(Thread):
def init(self, url_queue, img_queue, headers):
super().init()
self.url_queue = url_queue
self.img_queue = img_queue
self.headers = headers
def run(self):
while True:
if self.url_queue.empty():
break
url = self.url_queue.get()
response = requests.get(url, headers=self.headers)
html = response.content.decode('utf-8')
selector = etree.HTML(html)
img_list = selector.xpath('//div[@class="pic"]/a/img/@src')
self.img_queue.put(img_list)
class Consumer(Thread):
def init(self, url_queue, img_queue, headers, path):
super().init()
self.url_queue = url_queue
self.img_queue = img_queue
self.headers = headers
self.path = path
def run(self):
while True:
if self.img_queue.empty() and self.url_queue.empty():
break
img_list = self.img_queue.get()
for img_url in img_list:
response = requests.get(img_url, headers=self.headers)
img_name = re.findall(r'/(\w+\.jpg$)', img_url)[0]
img_path = os.path.join(self.path, img_name)
with open(img_path, 'wb') as f:
f.write(response.content)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
url_queue = Queue()
for i in range(1, 10):
url = 'http://www.meizitu.com/a/list_1_{}.html'.format(i)
url_queue.put(url)
img_queue = Queue()
path = './images'
if not os.path.exists(path):
os.mkdir(path)
producers = [Producer(url_queue, img_queue, headers) for i in range(10)]
consumers = [Consumer(url_queue, img_queue, headers, path) for i in range(10)]
for p in producers:
p.start()
for c in consumers:
c.start()
for p in producers:
p.join()
for c in consumers:
c.join()
```
总结
本文讲解了《Python7个爬虫小案例详解》中篇的完整攻略,包括了爬取糗事百科段子和爬取妹子图的示例说明。在实际使用时,需要根据自己的需求进行修改和优化,但是本文提供的示例可以作为一个基础框架供参考。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:Python7个爬虫小案例详解(附源码)中篇 - Python技术站