详解 Scikit-learn 的 feature_selection.SelectPercentile函数：选择百分比最重要的特征

2023年3月30日下午7:32 • sklearn-function

1. SelectPercentile 函数的作用

SelectPercentile 函数是 Scikit-learn 库中的特征选择函数，主要用于从原始数据中选取最佳的特征子集，以用于机器学习算法的训练或预测。其中 SelectPercentile 是一种基于统计检验的特征选择方法，它通过自主学习原始数据中各特征与结果变量的相关性，挑选出最相关的前 n 个特征，提高机器学习算法的训练精度和预测性能。

2. SelectPercentile 函数的使用方法

下面是使用 Scikit-learn 中 SelectPercentile 函数的基本流程：

2.1 导入需要的库和数据集

import numpy as np
from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectPercentile, chi2

2.2 加载数据，对数据进行预处理

digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

2.3 创建特征选择模型

select = SelectPercentile(chi2, percentile=10)

其中，chi2 即卡方函数，是一种统计检验方法，percentile 参数是设定要保留的特征占原始特征量的百分比。

2.4 训练模型并得到新的数据集

X_new = select.fit_transform(X, y)

使用 fit_transform 函数可以训练模型，并返回使用特征选择后的数据集。

2.5 输出结果

print("原始特征维度：", X.shape)
print("新特征维度：", X_new.shape)

此处，我们将原始数据集与选择后的新数据集的维度打印出来。

3. 实例讲解

下面给出两个样例，一个是通过二分类数据集演示 SelectPercentile 函数的使用，一个是通过回归数据集演示其使用。

3.1 样本一：利用 SelectPercentile 选取二分类数据集中的特征

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# 导入肿瘤数据集
cancer = load_breast_cancer()

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

# 创建模型
select = SelectPercentile(percentile=50)
LR = LogisticRegression(max_iter=10000)

# 利用 pipeline 整合模型
pipe = Pipeline([('select', select), ('logistic', LR)])
pipe.fit(X_train, y_train)

# 输出模型得分
print("LogisticRegression score: {:.3f}".format(pipe.score(X_test, y_test)))

在上述演示中，我们利用 SelectPercentile 函数提取肿瘤算法中最相关的50%的特征，再整合 LogisiticRegression 算法，对训练集进行训练，以用于分类预测。结果，我们得到的分类准确率为0.951。

3.2 样本二：利用 SelectPercentile 选取回归数据集中的特征

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# 导入糖尿病负荷数据集
diabetes = load_diabetes()

# 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, random_state=0)

# 创建模型
select = SelectPercentile(percentile=50)
LR = LinearRegression()

# 利用 pipeline 整合模型
pipe = Pipeline([('select', select), ('regression', LR)])
pipe.fit(X_train, y_train)

# 输出模型得分
print("LinearRegression score: {:.3f}".format(pipe.score(X_test, y_test)))

在上述演示中，我们利用 SelectPercentile 函数提取糖尿病数据集中最相关的50%的特征，再整合线性回归计算模型，对训练集进行训练，以预测糖尿病的负荷量。结果，我们得到的预测准确率为0.472。

经过以上实例演示，我们可以看出 SelectPercentile 函数的用法通过提取最相关的特征，可以有效地提高模型训练的准确性和预测的可靠性。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：详解 Scikit-learn 的 feature_selection.SelectPercentile函数：选择百分比最重要的特征 - Python技术站