背景

爬虫单位时间内请求数多,对己方机器、对方服务器都会形成压力,如果每个请求都开启一个新连接,更是如此;如果服务器支持keep-alive,爬虫就可以通过多个请求共用一个连接实现“减员增效”:单位时间内新建、关闭的连接的数目少了,但可实现的有效请求多了,并且也能有效降低给目标服务器造成的压力。

keep-alive的好处:(HTTP persistent connection

  • Lower CPU and memory usage (because fewer connections are open simultaneously).
  • Enables HTTP pipelining of requests and responses.
  • Reduced network congestion (fewer TCP connections).
  • Reduced latency in subsequent requests (no handshaking).
  • Errors can be reported without the penalty of closing the TCP connection.

实现

各http client对http protocol实现的程度不一,有些是不支持keep-alive的,就python来说:

下面是requests的实现代码(来自于本人项目代码,做了缩减、抽象)。

import sys
import time
import requests

def getSession():
	s = requests.Session()
	s.mount('http://', requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1, max_retries=0, pool_block=False))
	return s

def main():
	# start time of the current session
	st = time.time()
	# init the first session
	s = getSession()
	# init the keep-alive timeout value
	kato = 5
	# loop work
	for task in tasks:
		# use time of the current session
		ut = time.time() - st
		# rebuild the session according to the use time
		if ut >= kato:
			s = getSession()
			# clear the start time of the current session
			st = time.time()
		url = "https://www.example.com/%s" % task
		# to bypass the antiSpider solutions
		headers = {'user-agent': "a new ua", "Cookie": "a new cookie id"}
		# get response
		try:
			r = s.get(url, headers = headers, allow_redirects = False)
			# need some robust fix
			kato = int(r.headers["Keep-Alive"].replace("timeout=")) - 3
		except Exception,e:
			tasks.insert(0, task)
			print str(e)
			continue
		# handle the response according to the status_code, etc.
		if r.status_code == 404:
			pass
		elif r.status_code == 301:
			pass
		elif r.status_code == 200:
			info = "your info"
			pass
		elif r.status_code == 403:
			tasks.insert(0, task)
			print "as triggered, will sleep for 5 minutes"
			time.sleep(300)
			s = getSession()
			continue
		else:
			print sid, r.status_code
			sys.exit()
		sn += 1
		print "%s, %s, %s" % (sn, sid, info)