Python使用scrapy采集数据时,为了防止被网站识别为爬虫而被封禁,需要经常更换请求头中的user-agent字段,使得数据请求看起来像是来自真实的浏览器。本文将介绍如何使用scrapy实现为每个请求随机分配user-agent的方法。
前置知识
在了解方法之前,需要掌握基础的scrapy知识,包括scrapy的基本用法、pipeline的作用、Scrapy对Asynchronous requests的支持等等。
具体步骤
- 在settings.py中,设置随机user-agent的中间件和Headers,如下:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400
}
# 启用fake user agent
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
'scrapy_fake_useragent.providers.ProxyAwareUserAgentProvider',
]
FAKEUSERAGENT_FALLBACK = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
- 安装第三方库fake_useragent
bash
pip install fake_useragent
- 创建自定义插件,获取随机user-agent
```python
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from fake_useragent import UserAgent
class RandomUserAgentMiddleware(UserAgentMiddleware):
def init(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = UserAgent(verify_ssl=False).random
request.headers.setdefault('User-Agent', ua)
```
- 将自定义插件添加到pipelines.py中
```python
class RandomUserAgentPipeline(object):
def init(self):
self.ua = UserAgent(verify_ssl=False)
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', self.ua.random)
```
- 在settings.py中启用随机user-agent
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'myproject.middlewares.RandomUserAgentMiddleware': 400,
}
ITEM_PIPELINES = {
'myproject.pipelines.RandomUserAgentPipeline': 300,
}
```
示例说明
下面是两个示例,展示了如何在实际情况中使用随机user-agent获取数据:
示例一
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
urls = [
"http://example.com/page1",
"http://example.com/page2",
"http://example.com/page3",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# 获取网页内容
page = response.url.split("/")[-2]
filename = f'page-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
在这个示例中,为了使用随机user-agent,我们只需要在settings.py中启用RandomUserAgentMiddleware和RandomUserAgentPipeline,即可使每个请求的headers中都包含一个随机的user-agent字段。
示例二
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from myproject.items import Product
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (Rule(LinkExtractor(allow=('music\.aspx', )), callback='parse_page'),)
def parse_page(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]/text()')
l.add_xpath('price', '//div[@class="product_price"]/text()')
yield l.load_item()
在这个示例中,我们可以将随机user-agent添加到RandomUserAgentPipeline,使得爬虫可以随机使用不同的user-agent获取网站上的商品信息,进而进行数据分析和处理。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:Python使用scrapy采集数据时为每个请求随机分配user-agent的方法 - Python技术站