Python爬虫学习笔记（一）

1.urllib2简介

urllib2的是爬取URL（统一资源定位器）的Python模块。它提供了一个非常简单的接口，使用urlopen函数。它能够使用多种不同的协议来爬取URL。
它还提供了一个稍微复杂的接口，用于处理常见的情况 - 如基本身份验证，cookies，代理等。

2.抓取URLs

使用urlib2的简单方式可以像下面一样：

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
print html

输出就是爬取的网页内容。

我们可以使用urllib2抓取格式形式的url，可以将‘http：’用‘ftp：’，‘file：’等代替。http是基于请求应答模式，urllib2使用Request代表HTTP请求，最简单的形式是创建一个Request对象，指定要获取的URL。使用Request对象调用urlopen，返回一个请求的URL

响应对象。此响应是一个类似文件的对象，这意味着你可以对这个对象使用.read（）：

import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page

urlib2可以使用各种URL模式，例如可以使用ftp形式：

req = urllib2.Request('ftp://example.com/')

3.Data

有时你想将数据发送到一个URL（通常是URL将指向一个CGI（通用网关接口）脚本或其他Web应用程序）。

通过HTTP，这通常使用一个POST请求。这是当你提交你填写的HTML表单，浏览器通常使用POST请求。

并非所有POST都都来源于表单：你可以使用一个POST传送任意数据到自己的应用程序。

在通常情况下HTML表单，需要对数据编码成标准方式，然后传递到请求对象作为数据参数。编码是使用的函数来自urllib库不是从urllib2的。

import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

如果你没有提交data参数，urllib2使用GET请求。GET和POST请求不同之处在于POST请求通常有“副作用”：他们以某种方式改变了系统的状态。

虽然HTTP标准明确规定，POST可能会引起副作用，而GET请求从来没有引起副作用，data也可以在HTTP GET请求通过在URL本身编码来传送。

>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values # The order may differ.
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)

全的URL需要加一个？在URL后面，后面跟着encoded values。

4 Headers

我们将在这里讨论一个特定的HTTP头，来说明如何headers添加到您的HTTP请求。有些网站不喜欢被程序浏览，或发送不同的版本内容到不同的浏览器。

urllib2默认的自身标识为Python-urllib/ XY（x和y是Python主版本和次版本号,例如Python-urllib/2.5），这可能会使网站迷惑，或只是简单的不能正常工作。

浏览器通过User-Agent标识自己，当你创建一个Request对象，你可以传送一个包含头部的字典。

下面的例子标题的字典作出了和上面同样的要求，但自身标识为 Internet Explorer 5 。

import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

5 URLError

urlopen不能处理的响应时（通常的Python APIs异常如ValueError,TypeError等也会同时产生）他会引发URLError。

HTTPError是URLError的子类，一般在特定的HTTP URL中产生。

通常，URLError产生是因为没有网络连接（到指定的服务器的路由），或指定的服务器不存在。在这种情况下，所提出的异常将有一个“reason”属性，它是含有一个元组包含错误代码和文本错误消息。

import urllib2
req = urllib2.Request('http://www.pretend_server.org')
try:
  urllib2.urlopen(req)
except urllib2.URLError as e:
  print e.reason

输出是：

[Errno -2] Name or service not known

6 HTTPError

来自服务器的HTTP响应包含一个数字“状态码”。

有时，状态代码表示服务器无法完成请求。默认处理程序将处理一些这类的响应（例如，如果该响应是一个“重定向”，请求客户端从不同的URL获取文档，urllib2将会处理）。

对于那些它不能处理，urlopen会引发HTTPError。典型错误包括“404”（找不到网页），“403”（要求禁止），和'401'（需要身份验证）。

下面是Error Codes

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}

当错误被返回一个HTTP错误代码和错误页面提高服务器响应。您可以使用为页面上的响应HTTPError这样的实例返回。这意味着，以及代码属性，它也有阅读中，getURL和信息，方法。
当一个错误号产生后，服务器会返回一个HTTP错误号和一个错误页面。
可以使用HTTPError实例作为页面返回的response应答对象。
这表示和错误属性一样，它同样包含了read,geturl,和info方法。

import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
    urllib2.urlopen(req)
except urllib2.HTTPError as e:
    print e.code
    print e.read()

运行发现：

404
<!doctype html>



<html class="no-js" lang="en" dir="ltr">

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫学习笔记（一） - Python技术站