Python 使用Iris数据集的Pandas基础知识

2023年3月27日下午2:34 • python-answer

首先，让我们简单介绍一下Iris数据集。Iris数据集是一个经典的多变量数据集，用于分类和聚类算法的测试和演示，由Fisher在1936年创造，并称为Iris花卉数据集。它包含150个观察值，分别代表三个不同品种的鸢尾花，每个品种包含50个样本。每个样本都包含了萼片长度、萼片宽度、花瓣长度和花瓣宽度四个特征。

接下来，我们将详细介绍如何使用Pandas库来操作Iris数据集。

首先，让我们导入必要的库和加载数据集：

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()  # 加载数据集

然后，让我们创建一个Pandas DataFrame来存储数据集：

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# 添加目标列
df['target'] = iris.target
df['target_names'] = iris.target_names[df['target']]

# 显示前五行数据
df.head()

上述代码中，我们使用了Pandas的DataFrame结构来存储数据集，利用feature_names属性来设置数据集的各个特征的名称，同时添加了目标列和目标名称列，以便于之后的数据分析和可视化。最后使用Dataframe的head函数显示前五行数据，如下所示：

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target_names
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

接下来，我们将利用Pandas的一些函数来完成数据集的初步探索：

df.info()  # 显示数据集信息

df.describe()  # 显示数据集统计信息

DataFrame的info方法会显示列名称、列数量、列数据类型等基本信息，比如：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   sepal length (cm) 150 non-null    float64
 1   sepal width (cm)  150 non-null    float64
 2   petal length (cm) 150 non-null    float64
 3   petal width (cm)  150 non-null    float64
 4   target            150 non-null    int32  
 5   target_names      150 non-null    object 
dtypes: float64(4), int32(1), object(1)
memory usage: 6.7+ KB

而DataFrame的describe方法则会显示数据集的统计信息，比如：

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

最后，我们可以通过Pandas的groupby方法和Matplotlib库的plot函数来对数据集进行分组和可视化，如下所示：

import matplotlib.pyplot as plt

grouped = df.groupby('target_names')

fig, ax = plt.subplots()
for name, group in grouped:
    ax.scatter(group['petal length (cm)'], group['petal width (cm)'], alpha=.5, label=name)
ax.legend()
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Petal Width (cm)')
plt.show()

上述代码中，我们首先通过groupby方法按照目标名称对数据集进行分组。然后，我们使用Matplotlib库的plot函数来绘制鸢尾花各品种的花瓣长度和宽度，结果如下所示：

Iris数据集可视化结果