Python爬虫BeautifulSoup的使用方式

介绍

BeautifulSoup是python中的一个html解析库，可以将复杂的html文档转化成一个比较简单的树形结构，以便于我们在程序中对其进行各种操作，例如提取数据、搜索文档等。在爬取网页数据时，BeautifulSoup是常用的工具之一。

安装

在使用BeautifulSoup之前，需要先安装库。可以使用pip工具进行安装，命令如下：

pip install beautifulsoup4

基本使用

使用BeautifulSoup的过程分为以下三个步骤：

获取html文档
构造BeautifulSoup对象
操作BeautifulSoup对象

获取html文档

可以使用python中的urllib库获取html文档。例如获取一个网页如下：

from urllib.request import urlopen
html = urlopen("http://www.example.com")
print(html.read())

构造BeautifulSoup对象

构造BeautifulSoup对象的方式很简单，只需要将获取到的html文档作为参数传入即可。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")

操作BeautifulSoup对象

得到BeautifulSoup对象之后，我们可以进行各种操作，例如搜索文档内容、提取文档内容等。搜索文档内容的方式有多种，常用的方式有两种：

找到第一个符合条件的内容

soup.find('tag', attrs={'attr': 'value'})

找到所有符合条件的内容

soup.find_all('tag', attrs={'attr': 'value'})

其中，'tag'是html中的标签，'attrs'是标签中的属性名和属性值，例如：

soup.find('h1', attrs={'class': 'header'})
soup.find_all('a', attrs={'href': 'http://www.example.com'})

还可以通过BeautifulSoup对象的属性获取文档内容，例如：

soup.title.string
soup.a['href']

示例

示例一

我们来爬取中国天气网（http://www.weather.com.cn/）今天的天气情况。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.weather.com.cn/weather/101010100.shtml")
soup = BeautifulSoup(html, features="html.parser")

today_weather = soup.find('p', attrs={'class': 'wea'}).string
today_temp = soup.find('p', attrs={'class': 'tem'}).find('span').string

print("今天的天气情况是{}，气温{}".format(today_weather, today_temp))

输出：

今天的天气情况是晴，气温2℃/12℃

示例二

我们来爬取糗事百科热门段子中的内容。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.qiushibaike.com/")
soup = BeautifulSoup(html, features="html.parser")

items = soup.find_all('div', attrs={'class': 'article'})

for item in items:
    joke = item.find('div', attrs={'class': 'content'}).find('span').get_text()
    print(joke)

输出：

今晚在很高的地方看星星，一个警察前来：“你在这等什么？”
“等朋友。”
“朋友什么时候来？”
“当你妈生我的时候。”
...

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫beautiful soup的使用方式 - Python技术站

python爬虫beautiful soup的使用方式

Python爬虫BeautifulSoup的使用方式

介绍

安装

基本使用

获取html文档

构造BeautifulSoup对象

操作BeautifulSoup对象

示例

示例一

示例二

相关文章