BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后便可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
1 from bs4 import BeautifulSoup 2 3 html_doc = """ 4 <html><head><title>The Dormouse's story</title></head> 5 <body> 6 asdf 7 <div class="title"> 8 <b>The Dormouse's story总共</b> 9 <h1>f</h1> 10 </div> 11 <div class="story">Once upon a time there were three little sisters; and their names were 12 <a class="sister0" >Els<span>f</span>ie</a>, 13 <a href="http://example.com/lacie" class="sister" >Lacie</a> and 14 <a href="http://example.com/tillie" class="sister" >Tillie</a>; 15 and they lived at the bottom of a well.</div> 16 ad<br/>sf 17 <p class="story">...</p> 18 </body> 19 </html> 20 """ 21 22 soup = BeautifulSoup(html_doc, features="lxml") 23 # 找到第一个a标签 24 tag1 = soup.find(name='a') 25 # 找到所有的a标签 26 tag2 = soup.find_all(name='a') 27 # 找到id=link2的标签 28 tag3 = soup.select('#link2')
简单示例
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:爬虫必备—BeautifulSoup - Python技术站