首先,对于将类别数据转换为数值数据,一般有两种方法:标签编码(Label Encoding)和独热编码(One-Hot Encoding)。下面分别介绍这两种方法的具体步骤及应用。
标签编码(Label Encoding)
1. 库的导入
from sklearn.preprocessing import LabelEncoder
2. 创建LabelEncoder对象
le = LabelEncoder()
3. 对特征列进行标签编码
data['column_name'] = le.fit_transform(data['column_name'])
其中,column_name
是需要标签编码的特征列名。
4. 示例说明
下面以Titanic数据集为例进行说明:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# 读取数据
data = pd.read_csv('titanic.csv')
# 查看数据
print(data.head())
# 对Embarked进行标签编码
le = LabelEncoder()
data['Embarked'] = le.fit_transform(data['Embarked'])
# 查看编码后的数据
print(data.head())
输出结果为:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 2
1 2 1 1 female 38.0 1 0 71.2833 0
2 3 1 3 female 26.0 0 0 7.9250 2
3 4 1 1 female 35.0 1 0 53.1000 2
4 5 0 3 male 35.0 0 0 8.0500 2
可以看到,原本的Embarked
列中包含三个不同的类别标签(S、C、Q),经过标签编码后,这三个标签被分别映射为了0、1、2。
独热编码(One-Hot Encoding)
1. 库的导入
from sklearn.preprocessing import OneHotEncoder
2. 创建OneHotEncoder对象
onehot = OneHotEncoder()
3. 对特征列进行独热编码
onehot.fit_transform(data['column_name'].values.reshape(-1,1)).toarray()
其中,column_name
是需要独热编码的特征列名。
4. 示例说明
还是以Titanic数据集为例进行说明:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# 读取数据
data = pd.read_csv('titanic.csv')
# 查看数据
print(data.head())
# 对Embarked进行独热编码
onehot = OneHotEncoder()
Embarked_onehot = onehot.fit_transform(data['Embarked'].values.reshape(-1,1)).toarray()
# 将编码结果转化为DataFrame,并拼接到原数据集上
Embarked_onehot = pd.DataFrame(Embarked_onehot,columns=['Embarked_C','Embarked_Q','Embarked_S'])
data = pd.concat([data,Embarked_onehot],axis=1)
# 查看处理后的数据
print(data.head())
输出结果为:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked Embarked_C Embarked_Q Embarked_S
0 1 0 3 male 22.0 1 0 7.2500 S 0.0 0.0 1.0
1 2 1 1 female 38.0 1 0 71.2833 C 1.0 0.0 0.0
2 3 1 3 female 26.0 0 0 7.9250 S 0.0 0.0 1.0
3 4 1 1 female 35.0 1 0 53.1000 S 0.0 0.0 1.0
4 5 0 3 male 35.0 0 0 8.0500 S 0.0 0.0 1.0
从输出结果中可以看到,经过独热编码后,原本的Embarked
列被转化为了三个列,分别对应于三个类别标签:C、Q、S,并且对应的数据为1或0,表示该行数据是否属于这个类别。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python数据预处理之将类别数据转换为数值的方法 - Python技术站