python爬虫之beautifulsoup的使用

2023年4月11日上午1:56 • 爬虫

一、Beautiful Soup的简介

　　简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

二、Beautiful Soup的下载与安装　

 1 #安装 Beautiful Soup
 2 pip install beautifulsoup4
 3 
 4 #安装解析器
 5 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:
 6 
 7 $ apt-get install Python-lxml
 8 
 9 $ easy_install lxml
10 
11 $ pip install lxml
12 
13 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
14 
15 $ apt-get install Python-html5lib
16 
17 $ easy_install html5lib
18 
19 $ pip install html5lib

三、 Beautiful Soup的简单使用

 1 '''
 2 pip3 install beautifulsoup4  # 安装bs4
 3 pip3 install lxml  # 下载lxml解析器
 4 '''
 5 html_doc = """
 6 <html><head><title>The Dormouse's story</title></head>
 7 <body>
 8 <p class="sister"><b>$37</b></p>
 9 <p class="story" >Once upon a time there were three little sisters; and their names were
10 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
11 <a href="http://example.com/lacie" class="sister" >Lacie</a> and
12 <a href="http://example.com/tillie" class="sister" >Tillie</a>;
13 and they lived at the bottom of a well.</p>
14 
15 <p class="story">...</p>
16 """
17 
18 # 从bs4中导入BeautifulSoup
19 from bs4 import BeautifulSoup
20 
21 # 调用BeautifulSoup实例化得到一个soup对象
22 # 参数一: 解析文本
23 # 参数二:
24 # 参数二: 解析器（html.parser、lxml...）
25 soup = BeautifulSoup(html_doc, 'lxml')
26 
27 print(soup)
28 print('*' * 100)
29 print(type(soup))
30 print('*' * 100)
31 # 文档美化
32 html = soup.prettify()
33 print(html)

四、 Beautiful Soup之遍历文档树

 1 html_doc = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 <p class="sister"><b>$37</b></p>
 5 <p class="story" >Once upon a time there were three little sisters; and their names were
 6 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
 7 <a href="http://example.com/lacie" class="sister" >Lacie</a> and
 8 <a href="http://example.com/tillie" class="sister" >Tillie</a>;
 9 and they lived at the bottom of a well.</p>
10 
11 <p class="story">...</p>
12 """
13 from bs4 import BeautifulSoup
14 soup = BeautifulSoup(html_doc, 'lxml')
15 
16 '''
17 遍历文档树：
18     1、直接使用
19     2、获取标签的名称
20     3、获取标签的属性
21     4、获取标签的内容
22     5、嵌套选择
23     6、子节点、子孙节点
24     7、父节点、祖先节点
25     8、兄弟节点
26 '''
27 
28 # 1、直接使用
29 print(soup.p)  # 查找第一个p标签
30 print(soup.a)  # 查找第一个a标签
31 
32 # 2、获取标签的名称
33 print(soup.head.name)  # 获取head标签的名称
34 
35 # 3、获取标签的属性
36 print(soup.a.attrs)  # 获取a标签中的所有属性
37 print(soup.a.attrs['href'])  # 获取a标签中的href属性
38 
39 # 4、获取标签的内容
40 print(soup.p.text)  # $37
41 
42 # 5、嵌套选择
43 print(soup.html.head)
44 
45 # 6、子节点、子孙节点
46 print(soup.body.children)  # body所有子节点，返回的是迭代器对象
47 print(list(soup.body.children))  # 强转成列表类型
48 
49 print(soup.body.descendants)  # 子孙节点
50 print(list(soup.body.descendants))  # 子孙节点
51 
52 #  7、父节点、祖先节点
53 print(soup.p.parent)  # 获取p标签的父亲节点
54 # 返回的是生成器对象
55 print(soup.p.parents)  # 获取p标签所有的祖先节点
56 print(list(soup.p.parents))
57 
58 # 8、兄弟节点
59 # 找下一个兄弟
60 print(soup.p.next_sibling)
61 # 找下面所有的兄弟，返回的是生成器
62 print(soup.p.next_siblings)
63 print(list(soup.p.next_siblings))
64 
65 # 找上一个兄弟
66 print(soup.a.previous_sibling)  # 找到第一个a标签的上一个兄弟节点
67 # 找到a标签上面的所有兄弟节点
68 print(soup.a.previous_siblings)  # 返回的是生成器
69 print(list(soup.a.previous_siblings))

四、 Beautiful Soup之搜索文档树

  1 html_doc = """
  2 <html><head><title>The Dormouse's story</title></head>
  3 <body>
  4 <p class="sister"><b>$37</b></p>
  5 <p class="story" >Once upon a time there were three little sisters; and their names were
  6 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
  7 <a href="http://example.com/lacie" class="sister" >Lacie</a> and
  8 <a href="http://example.com/tillie" class="sister" >Tillie</a>;
  9 and they lived at the bottom of a well.</p>
 10 
 11 <p class="story">...</p>
 12 """
 13 '''
 14 搜索文档树:
 15     find()  找一个  
 16     find_all()  找多个
 17     
 18 标签查找与属性查找:
 19     标签:
 20             name 属性匹配
 21             attrs 属性查找匹配
 22             text 文本匹配
 23             
 24         - 字符串过滤器   
 25             字符串全局匹配
 26 
 27         - 正则过滤器
 28             re模块匹配
 29 
 30         - 列表过滤器
 31             列表内的数据匹配
 32 
 33         - bool过滤器
 34             True匹配
 35 
 36         - 方法过滤器
 37             用于一些要的属性以及不需要的属性查找。
 38 
 39     属性:
 40         - class_
 41         - id
 42 '''
 43 
 44 from bs4 import BeautifulSoup
 45 soup = BeautifulSoup(html_doc, 'lxml')
 46 
 47 # # 字符串过滤器
 48 # name
 49 p_tag = soup.find(name='p')
 50 print(p_tag)  # 根据文本p查找某个标签
 51 # # 找到所有标签名为p的节点
 52 tag_s1 = soup.find_all(name='p')
 53 print(tag_s1)
 54 #
 55 #
 56 # # attrs
 57 # # 查找第一个class为sister的节点
 58 p = soup.find(attrs={"class": "sister"})
 59 # print(p)
 60 # # 查找所有class为sister的节点
 61 tag_s2 = soup.find_all(attrs={"class": "sister"})
 62 print(tag_s2)
 63 
 64 
 65 # text
 66 text = soup.find(text="$37")
 67 print(text)
 68 #
 69 #
 70 # # 配合使用:
 71 # # 找到一个id为link2、文本为Lacie的a标签
 72 a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie")
 73 print(a_tag)
 74 
 75 
 76 
 77 # # 正则过滤器
 78 import re
 79 # name
 80 p_tag = soup.find(name=re.compile('p'))
 81 print(p_tag)
 82 
 83 # 列表过滤器
 84 import re
 85 # name
 86 tags = soup.find_all(name=['p', 'a', re.compile('html')])
 87 print(tags)
 88 
 89 # - bool过滤器
 90 # True匹配
 91 # 找到有id的p标签
 92 p = soup.find(name='p', attrs={"id": True})
 93 print(p)
 94 
 95 # 方法过滤器
 96 # 匹配标签名为a、属性有id没有class的标签
 97 def have_id_class(tag):
 98     if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'):
 99         return tag
100 
101 tag = soup.find(name=have_id_class)
102 print(tag)

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫之beautifulsoup的使用 - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

python爬虫实践——破解登陆百度

上一篇 2023年4月11日

python爬虫实践——爬取“豆瓣top250”

下一篇 2023年4月11日

python爬虫学习心得：中国大学排名(附代码)

今天下午花时间学习了python爬虫的中国大学排名实例，颇有心得，于是在博客园与各位分享 import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url,timeout = 30) r.raise_for_stat…

爬虫 2023年4月11日
000
python妹子图简单爬虫实例

针对这个题目，我们可以按照如下步骤来实现一个Python的简单爬虫：寻找目标网站: 首先需要确定目标网站，比如我们要收集一些漂亮的妹子图片，我们可以选择网站 http://www.mmjpg.com/。分析目标网站：需要分析目标网站的网页结构和页面信息，确定数据获取的方式。模拟请求：由于获取数据需要向目标网站发送请求，需要使用Python模拟请求。解…

python 2023年5月14日
000
Python爬虫新手入门之初学lxml库

Python爬虫新手入门之初学lxml库什么是lxml库？ Lxml是一个Python库，它用于解析XML和HTML文档。它是Python中最好的HTML和XML解析器之一。安装lxml库在安装lxml库之前，首先需要确保已经安装了以下依赖项： libxml2 libxslt 在Linux系统中，可以使用以下命令安装这些依赖项： sudo apt-ge…

python 2023年5月14日
000
浅谈Python爬虫原理与数据抓取

针对 “浅谈Python爬虫原理与数据抓取” 这个主题，我们可以从以下几个方面入手进行讲解。 1. Python爬虫原理 Python爬虫是利用Python编写程序，自动化地抓取网络上的数据的一种技术。其主要原理是通过HTTP协议向Web服务器发送请求，获取服务器返回的数据，然后进行解析提取有用的信息。大体流程如下：发送HTTP请求获取服务器响应解析H…

python 2023年5月14日
000
腾讯视频信息数据爬虫开发【核心爬虫代码】

腾讯视频信息数据爬取程序代码【笔记】 # -*- coding: utf-8 -*- import scrapy from ..items import TencentItem,CommentItem import re,requests,json class TencentSpiderSpider(scrapy.Spider): name = …

爬虫 2023年4月11日
000
网络爬虫的相关综述

前言：对网络爬虫很感兴趣 —————————————————————————————————————————————— 一、网络爬虫的工作原理　　1.1等同于浏览器访问网页的工作原理。(详细情况见本文末尾博文链接) 　　　　（1）是一种真人的行为驱动　　　　（2）通过浏览器来自动执行人为的动作，将动作自动程序化。　　1.2网络爬虫就是将浏览器访问网页…

爬虫 2023年4月11日
000
通过淘宝数据爬虫学习python scrapy requests与response对象

下面是关于“通过淘宝数据爬虫学习python scrapy requests与response对象”的完整攻略： 1. 爬虫环境的搭建首先，我们需要搭建Python爬虫环境。本攻略推荐使用Python 3.7版本及以上的版本进行搭建。同时，建议使用虚拟环境进行Python的配置，以免与当前环境产生冲突。使用以下命令创建一个名为spider_env的虚拟环境…

python 2023年5月14日
000
Java爬虫爬取京东

需求分析首先访问京东，搜索手机，分析页面，我们抓取以下商品数据：商品图片、价格、标题、商品详情页 SPU和SKU 除了以上四个属性以外，我们发现上图中的苹果手机有四种产品，我们应该每一种都要抓取。那么这里就必须要了解spu和sku的概念。 SPU = Standard Product Unit （标准产品单位） SPU是商品信息聚合的最小单位，是一组可复…

爬虫 2023年4月11日
000

合作推广

合作推广

返回顶部