Scrapy是一个功能强大的Python爬虫框架,它的中间件可以在爬虫运行的不同阶段进行拦截和调整请求和响应。Scrapy内置了一些中间件,这些中间件的顺序是固定的,对于新手来说,这可能会导致一些困惑和难以解决的问题。下面我将详细讲解"详解scrapy内置中间件的顺序",以及在某种情况下如何更改中间件的顺序。
Scrapy内置中间件的顺序
Scrapy内置的中间件按照以下顺序执行:
-
Downloader Middleware:
-
HttpCompressionMiddleware
- RobotsTxtMiddleware
- HttpAuthMiddleware
- DownloadTimeoutMiddleware
- UserAgentMiddleware
- RetryMiddleware
- RedirectMiddleware
- CookiesMiddleware
- HttpProxyMiddleware
- HttpErrorMiddleware
- RefererMiddleware
- MetaRefreshMiddleware
-
HttpCompressionMiddleware
-
Spider Middleware:
-
DepthMiddleware
上面列举的是Scrapy框架内置中间件的顺序。其中,Downloader Middleware用于在下载器执行请求前后进行处理,Spider Middleware用于在Spider执行前后进行处理。如果想了解每个中间件的具体作用和顺序,可以参考Scrapy官方文档。
如何修改Scrapy中内置中间件的顺序
在某些情况下,我们可能需要更改Scrapy内置中间件的顺序。例如,如果我们需要在下载前设置请求头部,但是该请求头部依赖于前一个响应中的一些信息,我们需要将UserAgentMiddleware放在更前面,这样才能保证请求头在发送前被正确设置。
下面是如何在Scrapy中修改中间件的顺序:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomHttpCompressionMiddleware': 550,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 600,
'myproject.middlewares.CustomHttpAuthMiddleware': 610,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 620,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 650,
'myproject.middlewares.CustomUserAgentMiddleware': 550,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.httperror.HttpErrorMiddleware': 800,
'myproject.middlewares.CustomRefererMiddleware': 820,
'scrapy.downloadermiddlewares.referer.RefererMiddleware': 840,
'scrapy.downloadermiddlewares.metarefresh.MetaRefreshMiddleware': 900,
}
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomDepthMiddleware': 900,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 800,
}
上面的代码展示了如何修改Downloader Middleware和Spider Middleware的顺序。修改的方法是:将需要修改顺序的中间件的名称和优先级加入到项目的配置文件中。
例如,要将CustomUserAgentMiddleware放在UserAgentMiddleware之前,可以修改为:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomHttpCompressionMiddleware': 550,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 600,
'myproject.middlewares.CustomHttpAuthMiddleware': 610,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 620,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 650,
'myproject.middlewares.CustomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.httperror.HttpErrorMiddleware': 800,
'myproject.middlewares.CustomRefererMiddleware': 820,
'scrapy.downloadermiddlewares.referer.RefererMiddleware': 840,
'scrapy.downloadermiddlewares.metarefresh.MetaRefreshMiddleware': 900,
}
在上面的配置中,将CustomUserAgentMiddleware的优先级设置为400,这样就会在UserAgentMiddleware之前执行CustomUserAgentMiddleware。
示例说明
示例一:
一个常见需求是请求头设置。我们通常会使用scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
来设置请求头中的User-Agent字段。但是,有些网站需要在请求头中包含其他信息,例如Cookie、Referer等。这时候我们需要自定义中间件,把User-Agent和Cookie、Referer等头部信息一起设置。具体可以参考以下代码:
class CustomHeadersMiddleware(object):
def process_request(self, request, spider):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Referer': 'https://www.google.com/',
'Cookie': 'sessionid=123456789'
}
request.headers.update(headers)
return None
其中,process_request(self, request, spider)
方法会在请求被发送前自动执行,将headers中定义的键值对设置进request对象对应的headers中。
在使用自定义的请求头时,我们需要将自定义的中间件放在scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
之前,以保证请求头的正确配置顺序。在settings.py
文件中添加如下设置:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 401,
}
示例二:
有些网站限制了请求的访问速度和频率,为此,我们需要在爬虫中添加延时设置。Scrapy中内置了scrapy.downloadermiddlewares.retry.RetryMiddleware
中间件来设置请求超时重试。但是,它只会重试请求,不会添加任何延迟、等待时间的处理。所以,我们需要自定义一个中间件来添加延迟、等待时间的逻辑。具体可以参考以下代码:
import random
class CustomThrottleMiddleware(object):
def __init__(self, delay):
self.delay = delay
@classmethod
def from_crawler(cls, crawler):
delay = crawler.settings.getint('DOWNLOAD_DELAY')
return cls(delay)
def process_request(self, request, spider):
delay = random.uniform(0, self.delay)
spider.logger.info("Delaying request for %.2f seconds..." % delay)
time.sleep(delay)
这个中间件可以将请求延时(delay)时间之后再执行。在使用自定义的延时设置中间件时,需要将自定义的中间件放在下载中间件链的最前面,以最先执行。在settings.py
文件中添加如下设置:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomThrottleMiddleware': 150,
}
这里的"150"是自定义的下载中间件优先级,值越小优先级越高。若中间件中有多个,按照从上往下执行的顺序。我们使用自定义的延时设置中间件时,将其设置为比其他下载中间件更优先,以最先执行延时逻辑。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:详解scrapy内置中间件的顺序 - Python技术站