Python爬虫获取百度首页内容教学

想要获取百度首页内容，需要通过Python编写爬虫来实现。其中需要用到以下工具：

Python 3
requests库
BeautifulSoup库

步骤1：安装Python 3

请前往官方网站（https://www.python.org/downloads/）下载并安装最新版Python 3。

步骤2：安装requests库

在命令行中输入以下命令进行安装：

pip install requests

步骤3：安装BeautifulSoup库

在命令行中输入以下命令进行安装：

pip install beautifulsoup4

步骤4：编写Python爬虫代码

在Python IDE中新建一个文件，将以下代码复制粘贴并保存。代码含有详细注释。

import requests
from bs4 import BeautifulSoup

# 设置请求头，避免被网站识别为机器人
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 发送GET请求
response = requests.get("https://www.baidu.com/", headers=headers)

# 将请求返回的内容用BeautifulSoup库进行解析
soup = BeautifulSoup(response.text, 'html.parser')

# 打印百度首页的title标签内容
print(soup.title.string)

# 打印百度首页所有超链接的href属性
for link in soup.find_all('a'):
    print(link.get('href'))

示例1：获取百度首页title标签内容

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get("https://www.baidu.com/", headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.title.string)

代码中首先发送GET请求获取百度首页的内容，然后用BeautifulSoup库将内容解析成html。最后打印出title标签的内容。

示例2：获取百度首页所有超链接的href属性

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get("https://www.baidu.com/", headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

代码中同样是发送GET请求获取百度首页的内容并将其解析成html，然后遍历所有a标签，打印出其href属性的值。

以上就是Python爬虫获取百度首页内容的完整攻略，包含安装Python和所需库、编写代码及两个示例的详细步骤说明和代码示例。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫获取百度首页内容教学 - Python技术站

python爬虫获取百度首页内容教学