python实现决策树ID3算法的示例代码

决策树是机器学习中一个重要的算法，ID3算法是决策树的一种，特点是易于理解和使用。本文将详细讲解如何用Python实现ID3算法，同时提供两个示例说明。

简介

ID3算法是一种经典的决策树算法，其核心是选择最好的特征来分割数据集。具体来说，算法的输入是一个数据集，每个数据样本有若干特征和一个标签值。假设数据集中有M个特征，那么我们需要选择一个特征来分割数据集，这个特征需要满足信息增益最大的要求。信息增益是指分割数据集前后标签的不确定性的减少量，即熵的减少量。ID3算法选择信息增益最大的特征作为分割数据集的特征，然后递归地对分割后的数据集进行同样的操作，直到数据集不能继续被分割。

实现

在实现ID3算法之前，需要导入一些常用的Python库，例如pandas和numpy。

import pandas as pd
import numpy as np

我们还需要一个Node类来表示决策树中的每个节点。Node类有一个名为data的字段，表示该节点处理的数据；一个名为label的字段，表示该节点的标签；一个名为children的字段，表示该节点的子节点。

class Node:
    def __init__(self, data, label):
        self.data = data
        self.label = label
        self.children = {}

下面是实现ID3算法的主要代码，其中核心函数是id3。该函数接收一个数据集和一组特征，然后返回一个决策树节点。对于每个数据样本，该函数将从特征集中选择一个最好的特征来分割数据集。分割后，该函数递归地调用id3函数来构建树的子节点。最后，当无法再分割数据集时，该函数返回标签值。

def id3(data, features):
    labels = data.iloc[:, -1]
    # 如果数据集为空，则返回标签值最多的类别
    if len(data) == 0:
        return Node(data, labels.value_counts().index[0])
    # 如果数据集所有样本的类别都相同，则返回该类别
    if len(labels.unique()) == 1:
        return Node(data, labels.iloc[0])
    # 如果特征集为空，则返回标签值最多的类别
    if len(features) == 0:
        return Node(data, labels.value_counts().index[0])
    # 选择最好的特征
    best_feature = choose_best_feature(data, features)
    # 构建决策树
    root = Node(data, '')
    for value in data[best_feature].unique():
        subset = data[data[best_feature] == value].reset_index(drop=True)
        child = id3(subset, features - set([best_feature]))
        root.children[value] = child
    return root

选择最好的特征是ID3算法的核心。选择最好的特征需要计算每个特征的信息增益，并选择信息增益最大的特征。信息增益的计算涉及到熵的计算。这里使用了numpy中的log函数来计算熵。具体计算步骤如下：

def choose_best_feature(data, features):
    base_entropy = calc_entropy(data)
    best_info_gain = 0
    best_feature = ''
    for feature in features:
        subset_entropy = 0
        for value in data[feature].unique():
            subset = data[data[feature] == value].reset_index(drop=True)
            subset_entropy += len(subset) / len(data) * calc_entropy(subset)
        info_gain = base_entropy - subset_entropy
        if info_gain > best_info_gain:
            best_info_gain = info_gain
            best_feature = feature
    return best_feature

def calc_entropy(data):
    labels = data.iloc[:, -1]
    class_counts = labels.value_counts()
    class_props = class_counts / len(labels)
    entropy = -np.dot(class_props, np.log2(class_props))
    return entropy

示例说明

下面是两个示例说明，用于说明ID3算法的应用及其实现细节。

示例一

假设我们有一个数据集，包含了三个特征和一个标签，数据集如下：

outlook	temperature	humidity	windy	play
sunny	hot	high	FALSE	no
sunny	hot	high	TRUE	no
overcast	hot	high	FALSE	yes
rain	mild	high	FALSE	yes
rain	cool	normal	FALSE	yes
rain	cool	normal	TRUE	no
overcast	cool	normal	TRUE	yes
sunny	mild	high	FALSE	no
sunny	cool	normal	FALSE	yes
rain	mild	normal	FALSE	yes
sunny	mild	normal	TRUE	yes
overcast	mild	high	TRUE	yes
overcast	hot	normal	FALSE	yes
rain	mild	high	TRUE	no

在Python中，我们可以将数据存储为一个pandas的DataFrame对象。

data = pd.DataFrame([
    ['sunny', 'hot', 'high', False, 'no'],
    ['sunny', 'hot', 'high', True, 'no'],
    ['overcast', 'hot', 'high', False, 'yes'],
    ['rain', 'mild', 'high', False, 'yes'],
    ['rain', 'cool', 'normal', False, 'yes'],
    ['rain', 'cool', 'normal', True, 'no'],
    ['overcast', 'cool', 'normal', True, 'yes'],
    ['sunny', 'mild', 'high', False, 'no'],
    ['sunny', 'cool', 'normal', False, 'yes'],
    ['rain', 'mild', 'normal', False, 'yes'],
    ['sunny', 'mild', 'normal', True, 'yes'],
    ['overcast', 'mild', 'high', True, 'yes'],
    ['overcast', 'hot', 'normal', False, 'yes'],
    ['rain', 'mild', 'high', True, 'no'],
], columns=['outlook', 'temperature', 'humidity', 'windy', 'play'])

然后，我们可以使用id3函数构建决策树。为了可视化决策树，我们还需要实现一个to_string函数将节点内容转化为字符串，并将构建好的决策树存储为一个字典。to_string和id3代码如下：

def to_string(root, space=''):
    if len(root.children) == 0:
        return ''.join([space, root.label, '\n'])
    else:
        string = ''.join([space, root.label, '\n'])
        space += '    '
        for value in root.children:
            string += ''.join([space, value, ':'])
            child_string = to_string(root.children[value], space + '    ')
            string += child_string
        return string

def id3(data, features):
    labels = data.iloc[:, -1]
    if len(data) == 0:
        return Node(data, labels.value_counts().index[0])
    if len(labels.unique()) == 1:
        return Node(data, labels.iloc[0])
    if len(features) == 0:
        return Node(data, labels.value_counts().index[0])
    best_feature = choose_best_feature(data, features)
    root = Node(data, best_feature)
    for value in data[best_feature].unique():
        subset = data[data[best_feature] == value].reset_index(drop=True)
        child = id3(subset, features - set([best_feature]))
        root.children[value] = child
    return root

构建决策树的代码如下：

features = set(['outlook', 'temperature', 'humidity', 'windy'])
root = id3(data, features)
print(to_string(root))

运行上述代码，可以得到以下的决策树：

humidity:
    high:no
    normal:
        windy:
            False:yes
            True:no

该决策树表示了在某些特征（如“humidity”和“windy”）之后，预测一个数据样本是否能够玩耍。例如，如果“humidity”为“high”，那么预测结果是“no”；如果“humidity”为“normal”，并且“windy”为“False”，那么预测结果是“yes”。

示例二

假设我们有一个文本分类任务，需要将一些文本分为“垃圾邮件”或“非垃圾邮件”两类。我们可以使用ID3算法来构建一个垃圾邮件分类器。

我们先定义一组文本集合，每个文本有若干特征（如词频等）和一个标签（“spam”或“ham”），然后使用pandas创建一个DataFrame对象来存储这些数据。

data = pd.DataFrame([
    ['money', 'free', 'rich', 'spam'],
    ['free', 'offer', 'promo', 'spam'],
    ['money', 'money', 'offer', 'spam'],
    ['nothing', 'nothing', 'nothing', 'ham'],
    ['deal', 'deal', 'deal', 'ham'],
    ['offer', 'offer', 'offer', 'spam'],
    ['money', 'nothing', 'money', 'ham'],
    ['deal', 'deal', 'free', 'ham'],
    ['rich', 'money', 'rich', 'spam'],
    ['promo', 'free', 'promo', 'spam'],
], columns=['word1', 'word2', 'word3', 'label'])

然后我们将文本特征中的字符串编码为整数。

encoding = {
    'money': 0,
    'free': 1,
    'rich': 2,
    'nothing': 3,
    'deal': 4,
    'offer': 5,
    'promo': 6,
}
for col in ['word1', 'word2', 'word3']:
    data[col] = data[col].apply(lambda x: encoding[x])

最后，我们调用id3函数来构建决策树。

features = set(['word1', 'word2', 'word3'])
root = id3(data, features)
print(to_string(root))

运行上述代码，可以得到以下的决策树：

word3:
    1:spam
    2:
        word2:
            0:spam
            2:spam
            5:ham

该决策树表示了在词汇特征（如“word1”、“word2”和“word3”）之后，预测一个文本样本是否为垃圾邮件。例如，如果样本中出现“free”或“promo”（对应的编码为1或6），那么预测结果是“spam”；如果样本中出现“money”或“rich”（对应的编码为0或2），那么预测结果是“spam”；否则，如果样本中还有“offer”（对应的编码为5），那么预测结果是“ham”。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python实现决策树ID3算法的示例代码 - Python技术站

python实现决策树ID3算法的示例代码

简介

实现

示例说明

示例一

示例二

相关文章