Python3解析库BeautifulSoup4的安装配置与基本用法
什么是BeautifulSoup4
BeautifulSoup4 是一个 HTML 或 XML 的解析库,可以将复杂的 HTML 或 XML 文档转换成一个树形结构,提供简单的、Python 风格的 API 来遍历文档。它可以解析 HTML 和 XML 标记文档,支持 HTML5 标准,同时还支持在其中查找标记、修改标签属性和添加新标记等操作,是爬虫中常用的一个库。
安装BeautifulSoup4
使用 pip 命令来安装 BeautifulSoup4,打开命令行,输入以下命令:
pip install beautifulsoup4
如果出现权限问题,使用管理员权限运行命令行窗口。
基本使用
首先导入 Beautiful Soup 库:
from bs4 import BeautifulSoup
解析静态 HTML 页面
使用 BeautifulSoup 类的 soup = BeautifulSoup(html_doc, 'html.parser')
方法解析静态 HTML 页面,其中 html_doc
是要解析的 HTML 页面。
例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
运行结果:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
使用 prettify()
方法可以打印出解析后的 HTML 页面的结构,更加清晰明了。
解析动态 HTML 页面
解析动态 HTML 页面需要使用 Python 的 Requests 库等第三方库来获取 HTML 代码。例如:
import requests
from bs4 import BeautifulSoup
url = 'https://www.zhihu.com/explore'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())
运行结果:
<!DOCTYPE doctype html>
<html data-theme="light" lang="zh">
<head>
<title>
发现 - 知乎
</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="default" name="apple-mobile-web-app-status-bar-style"/>
<meta charset="utf-8"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
...
可以看到,这样就可以获取到动态 HTML 页面的代码,然后通过 BeautifulSoup 进行解析。
BeautifulSoup4 常用方法
find 和 find_all
find()
方法可以搜索文档树,查找符合条件的第一个元素,例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a'))
运行结果:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
find_all()
方法可以查找文档中所有符合条件的元素,例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
运行结果:
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
select
select()
方法可以支持一些 CSS 选择器的功能,例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.select('title'))
运行结果:
[<title>The Dormouse's story</title>]
获取标签属性
使用标签对象的 tag['attribute']
可以获取标签属性的值,例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.a
print(tag['href'])
运行结果:
http://example.com/elsie
修改标签属性和字符串
例如:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.a['href'] = 'http://new-link.com'
soup.a.string = 'New Link'
print(soup.prettify())
运行结果:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://new-link.com" id="link1">
New Link
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
以上就是 BeautifulSoup4 库的安装配置及基本用法的详细攻略。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python3解析库BeautifulSoup4的安装配置与基本用法 - Python技术站