网络爬虫（2）-异常处理

上一节中对网络爬虫的学习的准备工作作了简要的介绍，并以一个简单的网页爬取为例子。但网络是十分复杂的，对网站的访问并不一定都会成功，因此需要对爬取过程中的异常情况进行处理，否则爬虫在遇到异常情况时就会发生错误停止运行。

让我们看看urlopen中可能出现的异常：

html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")

这行代码主要可能发生两种异常：

1.网页在服务器上不存在（或获取页面的时候出现错误）
2.服务器不存在
第一种异常发生时，程序会返回HTTP错误，urlopen函数会抛出“HTTPError”异常。
第二种异常，urlopen会返回一个None对象。
加入对这两种异常的处理后，上一节中的代码如下：

 1 __author__ = 'f403'
 2 #coding = utf-8
 3 from urllib.request import urlopen
 4 from urllib.error import HTTPError
 5 from bs4 import BeautifulSoup
 6 
 7 try:
 8    html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")
 9    if html is None:
10       print("Url is not found")
11    else:
12       bsobj = BeautifulSoup(html,"html.parser")
13       print(bsobj.h1)
14 except HTTPError as e:
15    print(e)

加入异常处理后，可以处理网页访问中发生的异常，可以保证网页从服务器的成功获取。但这不能保证网页的内容和我们的预期一致，如上面的程序中，我们不能保证h1标签一定存在，因此我们需要考虑这类异常。

这类异常也可以分为2类：

1.访问一个不存在的标签

2.访问一个不存的标签的子标签

第一种情况出现时，BeautifulSoup返回一个None对象，而第二种情况会抛出AttributeError。

加入这部分的异常处理后，代码为：

 1 __author__ = 'f403'
 2 #coding = utf-8
 3 from urllib.request import urlopen
 4 from urllib.error import HTTPError
 5 from bs4 import BeautifulSoup
 6 
 7 try:
 8    html = urlopen("http://www.heibanke.com/lesson/crawler_ex00/")
 9    if html is None:
10       print("Url is not found")
11    else:
12       bsobj = BeautifulSoup(html,"html.parser")
13       try:
14          t = bsobj.h1
15          if t is None:
16             print("tag is not exist")
17          else:
18             print(t)
19       except AttributeError as e:
20          print(e)
21 except HTTPError as e:
22    print(e)