Python制作简单的网页爬虫

下面我来详细讲解一下Python制作简单的网页爬虫的完整攻略。

步骤一：准备工作

在开始编写网页爬虫之前，我们需要进行一些准备工作。

安装Python：我们需要先安装Python环境，推荐使用Python3以上版本。
安装爬虫库：Python有很多爬虫库，比如requests、BeautifulSoup、Scrapy等，需要根据需要选择合适的进行安装和使用。比较常用的库是requests和BeautifulSoup，可以使用pip进行安装：

pip install requests pip install beautifulsoup4
了解网页结构：在爬取网页信息之前，需要了解所爬取网页的结构，包括HTML标签、CSS选择器等。

步骤二：编写代码

使用requests库进行网页请求，获取网页内容。

```python
import requests

url = "http://www.example.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
content = response.text
```

使用BeautifulSoup库进行网页解析，获取所需要的信息。

```python
from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
title = soup.title.string
links = soup.find_all("a")
for link in links:
print(link.get("href"))
```

以上代码实现了请求网页并获取网页内容，然后使用BeautifulSoup进行解析，并获取网页标题和所有链接。输出结果如下：

http://www.example.com/about
http://www.example.com/contact
http://www.example.com/faq

下面再给一个实际应用的案例。

示例一：爬取豆瓣电影排行榜

以下是爬取豆瓣电影TOP250排行榜的完整代码：

import requests
from bs4 import BeautifulSoup

url = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
content = response.text

soup = BeautifulSoup(content, "html.parser")
items = soup.find_all("div", class_="item")
for item in items:
    title = item.find("span", class_="title").string
    rating = item.find("span", class_="rating_num").string
    print(title + " " + rating)

以上代码实现了请求豆瓣电影TOP250页面并获取网页内容，然后使用BeautifulSoup进行解析，并获取电影名称和评分。输出结果如下：

肖申克的救赎 9.7
霸王别姬 9.6
...

示例二：爬取腾讯新闻列表页

以下是爬取腾讯新闻列表页的完整代码：

import requests
from bs4 import BeautifulSoup

url = "http://news.qq.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
content = response.text

soup = BeautifulSoup(content, "html.parser")
items = soup.find_all("div", class_="Q-tpList")
for item in items:
    title = item.find("a").string
    link = item.find("a").get("href")
    print(title + " " + link)

以上代码实现了请求腾讯新闻页面并获取网页内容，然后使用BeautifulSoup进行解析，并获取新闻标题和链接。输出结果如下：

教育部：专业设置要有常识 功能定位要准确 http://news.qq.com/a/20170824/038413.htm
两万多元的大盘鸡 “高峰”开路费突破3千元 http://news.qq.com/a/20170824/005948.htm
...

步骤三：优化代码

网页爬虫一般需要进行多次请求，我们需要保证代码简洁、稳定。因此需要进行一下优化：

为requests模块设置超时时间
使用异常处理来处理网络连接和解析异常

优化后的豆瓣电影排行榜爬虫代码如下：

import requests
from bs4 import BeautifulSoup

def get_content(url, headers):
    try:
        response = requests.get(url, headers=headers, timeout=30)
        if response.status_code == 200:
            return response.text
        else:
            return None
    except:
        return None

def parse_content(content):
    try:
        soup = BeautifulSoup(content, "html.parser")
        items = soup.find_all("div", class_="item")
        for item in items:
            title = item.find("span", class_="title").string
            rating = item.find("span", class_="rating_num").string
            print(title + " " + rating)
    except:
        pass

url = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}

content = get_content(url, headers)
if content is not None:
    parse_content(content)
else:
    print("Request failed")

以上代码在原有基础上使用了异常处理，并对requests库进行了超时时间设置，这样可以更好地保证代码稳定性和安全性。

总结

以上就是Python制作简单的网页爬虫的完整攻略。要制作一个有效的网页爬虫，需要了解网页结构，选取适合自己的爬虫库，并注意代码的优化。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python制作简单的网页爬虫 - Python技术站