Python爬虫获取、解析、存储详解

准备工作

在开始爬虫之前，我们需要确保自己安装了以下两个库：

requests：用于发送HTTP请求和获取响应数据
BeautifulSoup4：解析HTML/XML数据

安装方式，可以使用pip命令进行安装：

pip install requests
pip install beautifulsoup4

获取数据

在使用Python进行爬虫之前，我们需要确定好我们要爬取的网站，并获取相关数据。通过requests库，我们可以轻松地发送请求和获取响应数据。

以下是一个示例，展示如何获取知乎网站上python标签下的问题：

import requests

url = 'https://www.zhihu.com/api/v4/questions'
params = {'include': 'data[*].answer_count,created,updated,upvoted_followees,status', 'limit': '20', 'offset': '0',
          'sort_by': 'created'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, params=params, headers=headers)

if response.status_code != 200:
    raise ValueError('Failed to get information from Zhihu')

data = response.json()
print(data)

在该示例中，我们使用requests库发送了一个GET请求，请求了知乎网站下python标签下的问题。参数通过params变量传递。通过headers变量，我们可以设置请求头部信息。最后，通过response.json()获取响应数据，并打印输出。

解析数据

一旦获取到了我们所需要的响应数据，我们便需要从中提取有用的信息。BeautifulSoup4是一个Python库，它可以帮助我们解析HTML和XML数据。以下是一个示例，展示如何解析我们刚才获取到的知乎问题数据：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('div', class_='Feed-item')

for item in items:
    title = item.find('div', class_='ContentItem-title').a.text
    content = item.find('div', class_='RichContent-inner').text.strip()
    print(title)
    print(content)

在该示例中，我们将获取到的响应数据传递给了BeautifulSoup对象。通过调用find_all方法，我们从HTML代码中提取了包含每个问题信息的部分。然后，使用find方法，我们提取了问题的标题和内容。

存储数据

将获取到的数据进行存储，可以让我们随时随地地快速浏览数据，并对数据进行分析。可以通过简单的方式，如写入到CSV文件中，或更复杂的方式，如使用关系数据库进行存储等。

以下是一个示例，展示如何将获取到的知乎问题信息写入到CSV文件中：

import csv

with open('zhihu_python_questions.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['title', 'content'])
    for item in items:
        title = item.find('div', class_='ContentItem-title').a.text
        content = item.find('div', class_='RichContent-inner').text.strip()
        writer.writerow([title, content])

在该示例中，我们使用Python内置的csv库，将获取到的问题信息写入到CSV文件中。我们使用writerow方法逐行写入数据。

示例说明

在此，我们来介绍两个示例，分别涉及了获取、解析和存储三个操作。

示例1 - 获取和解析网页数据

假设我们想获取某工程公司主页上所有领域的信息，可以使用以下代码：

import requests
from bs4 import BeautifulSoup

url = 'http://www.example.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# 获取网页中所有的h3标签和相关信息
h3_items = soup.find_all('h3', class_='field-title')

for item in h3_items:
    print(item.text.strip())

在这个示例中，我们使用requests库发送GET请求，获取了某公司的主页，并使用BeautifulSoup解析获取到的数据。我们使用find_all方法，查找了所有的h3标签，然后打印了标签中的文本内容。

示例2 - 获取和存储网络数据

假设我们想使用Python爬取知乎网站上Python标签下所有问题，并将这些问题信息保存到一个CSV文件中，我们可以使用以下代码：

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://www.zhihu.com/api/v4/questions'
params = {'include': 'data[*].answer_count,created,updated,upvoted_followees,status', 'limit': '20', 'offset': '0',
          'sort_by': 'created'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, params=params, headers=headers)

if response.status_code != 200:
    raise ValueError('Failed to get information from Zhihu')

soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('div', class_='Feed-item')

with open('zhihu_python_questions.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['title', 'content'])
    for item in items:
        title = item.find('div', class_='ContentItem-title').a.text
        content = item.find('div', class_='RichContent-inner').text.strip()
        writer.writerow([title, content])

在这个示例中，我们也使用requests库发送GET请求，获取了知乎网站上Python标签下所有问题，并使用BeautifulSoup解析获取到的数据。我们使用Python内置的csv库，将获取到的问题信息写入到CSV文件中。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫,获取,解析,存储详解 - Python技术站

Python爬虫,获取,解析,存储详解