《统计学习方法》学习笔记4——关于决策树Python代码学习

最新推荐文章于 2024-04-20 15:45:42 发布

原创最新推荐文章于 2024-04-20 15:45:42 发布 · 169 阅读

0 ·

CC 4.0 BY-SA版权

统计学习方法笔记专栏收录该内容

6 篇文章

订阅专栏

本文深入解析决策树算法的实现过程，包括经验熵、经验条件熵的计算，以及决策树类的定义与预测方法。探讨了xrange()与range()的区别，并介绍了如何利用lambda表达式优化代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

决策树实现

优快云博客：https://blog.youkuaiyun.com/wds2006sdo/article/details/52849400

lambda

https://blog.youkuaiyun.com/u010602026/article/details/67662004
https://blog.youkuaiyun.com/qq_38526635/article/details/81326004

xrange()与range()

range返回的是一个list对象，而xrange返回的是一个生成器对象(xrange object)。
xrange则不会直接生成一个list，而是每次调用返回其中的一个值，内存空间使用极少，因而性能非常好。
Python 3.x已经去掉xrange，全部用range代替。

shape()

https://blog.youkuaiyun.com/Mr_Cat123/article/details/78841747

代码解析

(max_class,max_len) = max([(i,len(filter(lambda x:x==i,train_label))) for i in xrange(total_class)],key = lambda x:x[1])

此段代码在于找出数据集中最大比例的类及其数量。
在10个class中通过len(filter(lambda x:x==i,train_label))找到每个class的长度，用key作为max比较索引，key = lambda x:x[1]表示比较max[]数组中的max[1]也就是len项。

class Tree(object):
    def __init__(self,node_type,Class = None, feature = None):
        self.node_type = node_type
        self.dict = {}
        self.Class = Class
        self.feature = feature

    def add_tree(self,val,tree):
        self.dict[val] = tree

    def predict(self,features):
        if self.node_type == 'leaf':
            return self.Class

        tree = self.dict[features[self.feature]]
        return tree.predict(features)

此段代码为决策树数据类，包含数据结构及方法。
其中，数据结构包含dict，feature，class，node_type。
pridict()函数用于依据构建好的树预测test数据的类。从上至下，根节点开始依次判断feature，直到找到leaf节点，返回class。

def calc_ent(x):
    """
        calculate shanno ent of x
    """

    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        p = float(x[x == x_value].shape[0]) / x.shape[0]
        logp = np.log2(p)
        ent -= p * logp

    return ent

def calc_condition_ent(x, y):
    """
        calculate ent H(y|x)
    """

    # calc ent(y|x)
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        sub_y = y[x == x_value]
        temp_ent = calc_ent(sub_y)
        ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent

    return ent

此段代码用于计算经验熵及经验条件熵。
经验熵：x_value_list 找到label中的类组成list，对于每一个x_value 计算p = float(x[x == x_value].shape[0]) / x.shape[0]，最后计算H(D)。
经验条件熵：计算每个Di（A特征不同分类D）下的H(Di)，调用H(D)计算。最终求和H(D|A)。

return Tree(LEAF,Class = max_class)

叶节点增加方式。

sub_features = filter(lambda x:x!=max_feature,features)
tree = Tree(INTERNAL,feature=max_feature)

中间特征节点增加方式。

for feature_value in feature_value_list:

        index = []
        for i in xrange(len(train_label)):
            if train_set[i][max_feature] == feature_value:
                index.append(i)

        sub_train_set = train_set[index]
        sub_train_label = train_label[index]

        sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features,epsilon)
        tree.add_tree(feature_value,sub_tree)

某一特征下不同取值递归构建决策树。