爬虫利用keep-alive实现“减员增效”

背景

爬虫单位时间内请求数多，对己方机器、对方服务器都会形成压力，如果每个请求都开启一个新连接，更是如此；如果服务器支持keep-alive，爬虫就可以通过多个请求共用一个连接实现“减员增效”：单位时间内新建、关闭的连接的数目少了，但可实现的有效请求多了，并且也能有效降低给目标服务器造成的压力。

keep-alive的好处：（HTTP persistent connection）

Lower CPU and memory usage (because fewer connections are open simultaneously).
Enables HTTP pipelining of requests and responses.
Reduced network congestion (fewer TCP connections).
Reduced latency in subsequent requests (no handshaking).
Errors can be reported without the penalty of closing the TCP connection.

实现

各http client对http protocol实现的程度不一，有些是不支持keep-alive的，就python来说：

urllib2不支持，参见urllib2 不能添加 HTTP header "Connection:Keep-Alive" 的问题，但可以通过加载第三方handler实现，参见Python urllib2 with keep alive
urllib3支持，参见pypi urllib3 1.20
httplib2支持，参见pypi httplib2 0.9.2
requests支持，参见requests高级用法#会话对象

下面是requests的实现代码（来自于本人项目代码，做了缩减、抽象）。

import sys
import time
import requests

def getSession():
	s = requests.Session()
	s.mount('http://', requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1, max_retries=0, pool_block=False))
	return s

def main():
	# start time of the current session
	st = time.time()
	# init the first session
	s = getSession()
	# init the keep-alive timeout value
	kato = 5
	# loop work
	for task in tasks:
		# use time of the current session
		ut = time.time() - st
		# rebuild the session according to the use time
		if ut >= kato:
			s = getSession()
			# clear the start time of the current session
			st = time.time()
		url = "https://www.example.com/%s" % task
		# to bypass the antiSpider solutions
		headers = {'user-agent': "a new ua", "Cookie": "a new cookie id"}
		# get response
		try:
			r = s.get(url, headers = headers, allow_redirects = False)
			# need some robust fix
			kato = int(r.headers["Keep-Alive"].replace("timeout=")) - 3
		except Exception,e:
			tasks.insert(0, task)
			print str(e)
			continue
		# handle the response according to the status_code, etc.
		if r.status_code == 404:
			pass
		elif r.status_code == 301:
			pass
		elif r.status_code == 200:
			info = "your info"
			pass
		elif r.status_code == 403:
			tasks.insert(0, task)
			print "as triggered, will sleep for 5 minutes"
			time.sleep(300)
			s = getSession()
			continue
		else:
			print sid, r.status_code
			sys.exit()
		sn += 1
		print "%s, %s, %s" % (sn, sid, info)

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：爬虫利用keep-alive实现“减员增效” - Python技术站