python爬虫开发之Beautiful Soup模块从安装到详细使用方法与实例

以下是“Python爬虫开发之BeautifulSoup模块从安装到详细使用方法与实例”的完整攻略：

步骤1：安装BeautifulSoup模块

在使用BeautifulSoup模块之前，需要安装它。以下是一个示例：

pip install beautifulsoup4

在这个例子中，我们使用pip命令安装了BeautifulSoup模块。

步骤2：导入模块

在完成安装BeautifulSoup模块后，我们需要导入它。以下是一个示例：

from bs4 import BeautifulSoup

在这个例子中，我们使用from语句导入了BeautifulSoup类。

步骤3：使用BeautifulSoup解析HTML

在导入BeautifulSoup模块后，我们可以使用它解析HTML代码。以下是一个示例：

html = '<html><head><title>Example</title></head><body><p>This is an example.</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')

在这个例子中，我们定义了一个名为html的字符串，其中包含HTML代码。然后，我们使用BeautifulSoup类解析HTML代码，并将结果存储在名为soup的变量中。

步骤4：使用BeautifulSoup查找元素

在使用BeautifulSoup解析HTML代码后，我们可以使用它查找元素。以下是一个示例：

html = '<html><head><title>Example</title></head><body><p>This is an example.</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')
title = soup.title
print(title.text)

在这个例子中，我们使用soup.title属性查找HTML代码中的标题，并使用print()函数打印标题文本。

示例1：使用BeautifulSoup爬取豆瓣电影TOP250

以下是一个示例代码，用于演示如何使用BeautifulSoup爬取豆瓣电影TOP250：

import requests
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/top250'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

movies = soup.select('.item')
for movie in movies:
    title = movie.select('.title')[0].text
    rating = movie.select('.rating_num')[0].text
    print(f'{title} {rating}')

在这个例子中，我们使用requests库发送了一个GET请求，并获取了豆瓣影TOP250的HTML代码。然后我们使用BeautifulSoup库解析HTML代码，并使用CSS选择器查找所有.item元素。最后，我们使用for循环遍历每个电影元素，并使用select()方法查找电影标题和评分，并打印每个电影标题和评分。

示例2：使用BeautifulSoup爬取糗事百科段子

以下是一个示例代码，用于演示如何使用BeautifulSoup爬取糗事百科段子：

import requests
from bs4 import BeautifulSoup

url = 'https://www.qiushibaike.com/text/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64;64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

articles = soup.select('.article')
for article in articles:
    content = article.select('.content')[0].text.strip()
    print(content)

在这个例子中，我们使用requests库发送了一个GET请求，并获取了糗事百科段子的HTML代码。然后我们使用BeautifulSoup库解析HTML代码，并使用CSS选择器查找所有元素。最后，我们使用for循环遍历每个段子元素，并使用select()方法查找子内容，并打印每个段子的内容。

以上就是“Python爬虫开发之BeautifulSoup模块从安装到详细使用方法与实例”的完整攻略，包括安装BeautifulSoup模块、导入模块、使用BeautifulSoup解析HTML、使用BeautifulSoup查找元素和两个示例代码，分别演示了如何使用BeautifulSoup爬取豆瓣电影TOP250和糗事百科段子。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python爬虫开发之Beautiful Soup模块从安装到详细使用方法与实例 - Python技术站