Python3使用 urllib 模块制作网络爬虫的完整攻略如下：

1. 导入 urllib 库

在 Python 中，必须要先导入 urllib 库，才能使用其中的模块和函数。

import urllib.request

2. 打开网页

使用 urllib.request 模块中的 urlopen() 函数可以打开一个网页，返回的是一个类文件对象，可以通过 read() 函数读取网页内容。

response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

3. 解析网页

读取网页内容后就可以使用解析库（如 BeautifulSoup、lxml 等）来解析网页，获取需要的数据。

以 BeautifulSoup 为例，示例代码如下：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
titles = soup.find_all('h1', class_='title')
for title in titles:
    print(title.text)

此代码使用 BeautifulSoup 库解析网页，并查找 class 等于‘title’的所有 h1 标签，打印出标签内的文本内容。

4. 使用代理

使用 urllib.request 模块可以设置代理服务器，以此来掩盖爬虫的真实身份，避免被封禁。

示例代码如下：

proxy_handler = urllib.request.ProxyHandler({"http": "http://user:password@proxy_ip:port"})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)

response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

此代码中添加了代理服务器，以 http 协议为例，用户名为 'user'，密码为 'password'，代理服务器地址为 'proxy_ip'，端口号为 'port'，爬取的目标网站为 'http://www.example.com/'。

5. 接收 cookie

某些网站需要用户先登录才能访问，这时就需要接收 Cookie，以此模拟登录。

示例代码如下：

import http.cookiejar

cookiejar = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookiejar)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.example.com/login')
data = {'username': 'username', 'password': '123456'}
post_data = urllib.parse.urlencode(data).encode('utf-8')

response = opener.open('http://www.example.com/login', data=post_data)
html = response.read()

response = opener.open('http://www.example.com/user')
html = response.read()

此代码中，首先定义了一个 CookieJar 对象，接着将其传给 HTTPCookieProcessor 处理器，然后以参数形式传给 build_opener() 函数构建 opener 对象。接着，用 opener 登录网站并将其返回的 response（带有 cookie 信息）保存在 html 变量中，最后用相同的 opener 访问受保护页面即可。

以上就是 Python3使用 urllib 模块制作网络爬虫的完整攻略。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python3使用urllib模块制作网络爬虫 - Python技术站

python3使用urllib模块制作网络爬虫

1. 导入 urllib 库

2. 打开网页

3. 解析网页

4. 使用代理

5. 接收 cookie

相关文章