Python爬取当网书籍数据并数据可视化展示

本文将详细讲解如何使用Python爬虫爬取当网书籍数据并进行数据可视化展示的完整攻略，包括数据爬取、数据清洗、数据分析和数据可视化。我们将使用Python的requests、BeautifulSoup、pandas和matplotlib等库来实现这个任务。

爬取数据

首先，我们需要从当网上爬取书籍数据。我们可以使用Python的requests和BeautifulSoup库来实现这个任务。以下是一个简单的Python代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.dangdang.com/cp01.00.00.00.00.00.html'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

books = soup.find_all('div', {'class': 'con'})

data = []
for book in books:
    title = book.find('a', {'class': 'pic'}).get('title')
    author = book.find('div', {'class': 'publisher_info'}).find_all('a')[0].text
    price = book.find('span', {'class': 'price_n'}).text
    data.append([title, author, price])

df = pd.DataFrame(data, columns=['Title', 'Author', 'Price'])
print(df.head())

在上面的示例中，我们首先定义了一个url变量，它指向当网的书籍页面。然后，我们使用requests库发送一个HTTP请求，并使用BeautifulSoup库解析HTML响应。我们使用find_all方法找到HTML中的所有书籍元素，并使用find方法找到每个书籍元素中的标题、作者和价格。最后，我们将这些数据保存到一个列表中，并使用pandas库的DataFrame方法将其转换为DataFrame对象。最后，我们打印DataFrame对象的前几行，以检查数据是否正确。

数据清洗

接下来，我们需要对爬取到的数据进行清洗。我们可以使用pandas库来实现这个任务。以下是一个简单的Python代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.dangdang.com/cp01.00.00.00.00.00.html'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

books = soup.find_all('div', {'class': 'con'})

data = []
for book in books:
    title = book.find('a', {'class': 'pic'}).get('title')
    author = book.find('div', {'class': 'publisher_info'}).find_all('a')[0].text
    price = book.find('span', {'class': 'price_n'}).text
    data.append([title, author, price])

df = pd.DataFrame(data, columns=['Title', 'Author', 'Price'])

# 清洗价格数据
df['Price'] = df['Price'].str.replace('¥', '').astype(float)

print(df.head())

在上面的示例中，我们首先使用之前的代码爬取了数据，并将其转换为DataFrame对象。然后，我们使用str.replace方法将价格数据中的'¥'符号替换为空字符串，并使用astype方法将其转换为浮点数类型。最后，我们打印DataFrame对象的前几行，以检查数据是否正确。

数据分析

接下来，我们需要对清洗后的数据进行分析。我们可以使用pandas库来实现这个任务。以下是一个简单的Python代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.dangdang.com/cp01.00.00.00.00.00.html'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

books = soup.find_all('div', {'class': 'con'})

data = []
for book in books:
    title = book.find('a', {'class': 'pic'}).get('title')
    author = book.find('div', {'class': 'publisher_info'}).find_all('a')[0].text
    price = book.find('span', {'class': 'price_n'}).text
    data.append([title, author, price])

df = pd.DataFrame(data, columns=['Title', 'Author', 'Price'])

# 清洗价格数据
df['Price'] = df['Price'].str.replace('¥', '').astype(float)

# 统计每个作者的书籍数量和平均价格
author_count = df.groupby('Author').agg({'Title': 'count', 'Price': 'mean'})
print(author_count)

# 统计每个价格区间的书籍数量
price_bins = [0, 20, 40, 60, 80, 100, 200, 500]
price_labels = ['0-20', '20-40', '40-60', '60-80', '80-100', '100-200', '200-500']
df['Price Range'] = pd.cut(df['Price'], bins=price_bins, labels=price_labels)
price_count = df.groupby('Price Range').agg({'Title': 'count'})
print(price_count)

在上面的示例中，我们首先使用之前的代码爬取了数据，并将其转换为DataFrame对象。然后，我们使用str.replace方法将价格数据中的'¥'符号替换为空字符串，并使用astype方法将其转换为浮点数类型。接着，我们使用groupby方法对作者进行分组，并使用agg方法统计每个作者的书籍数量和平均价格。最后，我们使用cut方法将价格数据分成不同的价格区间，并使用groupby方法统计每个价格区间的书籍数量。

数据可视化

最后，我们可以使用matplotlib库将数据可视化。以下是一个简单的Python代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt

url = 'https://www.dangdang.com/cp01.00.00.00.00.00.html'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

books = soup.find_all('div', {'class': 'con'})

data = []
for book in books:
    title = book.find('a', {'class': 'pic'}).get('title')
    author = book.find('div', {'class': 'publisher_info'}).find_all('a')[0].text
    price = book.find('span', {'class': 'price_n'}).text
    data.append([title, author, price])

df = pd.DataFrame(data, columns=['Title', 'Author', 'Price'])

# 清洗价格数据
df['Price'] = df['Price'].str.replace('¥', '').astype(float)

# 统计每个价格区间的书籍数量
price_bins = [0, 20, 40, 60, 80, 100, 200, 500]
price_labels = ['0-20', '20-40', '40-60', '60-80', '80-100', '100-200', '200-500']
df['Price Range'] = pd.cut(df['Price'], bins=price_bins, labels=price_labels)
price_count = df.groupby('Price Range').agg({'Title': 'count'})

# 绘制饼图
plt.pie(price_count['Title'], labels=price_count.index, autopct='%1.1f%%')
plt.title('Number of books by price range')
plt.show()

在上面的示例中，我们首先使用之前的代码爬取了数据，并将其转换为DataFrame对象。然后，我们使用str.replace方法将价格数据中的'¥'符号替换为空字符串，并使用astype方法将其转换为浮点数类型。接着，我们使用cut方法将价格数据分成不同的价格区间，并使用groupby方法统计每个价格区间的书籍数量。最后，我们使用matplotlib库的pie方法绘制了饼图，并使用title方法设置图表的标题。最后，我们使用show方法显示图表。

示例2：爬取多页数据并进行数据可视化

以下是一个爬取多页数据并进行数据可视化的Python代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt

url_template = 'https://search.dangdang.com/?key=python&act=input&page_index={}'

data = []
for page in range(1, 6):
    url = url_template.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    books = soup.find_all('li', {'class': 'line'})

    for book in books:
        title = book.find('a', {'class': 'pic'}).get('title')
        author = book.find('div', {'class': 'publisher_info'}).find_all('a')[0].text
        price = book.find('span', {'class': 'price_n'}).text
        data.append([title, author, price])

df = pd.DataFrame(data, columns=['Title', 'Author', 'Price'])

# 清洗价格数据
df['Price'] = df['Price'].str.replace('¥', '').astype(float)

# 统计每个价格区间的书籍数量
price_bins = [0, 20, 40, 60, 80, 100, 200, 500]
price_labels = ['0-20', '20-40', '40-60', '60-80', '80-100', '100-200', '200-500']
df['Price Range'] = pd.cut(df['Price'], bins=price_bins, labels=price_labels)
price_count = df.groupby('Price Range').agg({'Title': 'count'})

# 绘制柱状图
plt.bar(price_count.index, price_count['Title'])
plt.title('Number of books by price range')
plt.xlabel('Price Range')
plt.ylabel('Number of books')
plt.show()

在上面的示例中，我们首先定义了一个url_template变量，它包含一个占位符{}，用于指定页码。然后，我们使用循环遍历页码，并使用format方法将页码插入到url_template中。我们使用requests和BeautifulSoup库爬取每一页的数据，并将其转换为DataFrame对象。接着，我们使用cut方法将价格数据分成不同的价格区间，并使用groupby方法统计每个价格区间的书籍数量。最后，我们使用matplotlib库的bar方法绘制了柱状图，并使用title、xlabel和ylabel方法来设置图表的标题、x轴标签和y轴标签。最后，我们使用show方法显示图表。

总结

本文详细讲解了如何使用Python爬虫爬取当网书籍数据并进行数据可视化展示的完整攻略，包括数据爬取、数据清洗、数据分析和数据可视化。我们提供了两个示例，以便更好地理解这些方法的使用。在实际应用中，我们可以根据需要选择适合自己的方法，以便更好地爬取数据并进行数据可视化展示。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬取当网书籍数据并数据可视化展示 - Python技术站