DataWhale

Information Theory Basis

  • Entropy
    The entropy H ( x ) H(x) H(x) of a discrete random variable X X X with distribution p ( x ) p(x) p(x) is defined by H ( X ) = − ∑ x ∈ χ p ( x ) l o g p ( x ) H(X)=-\sum_{x\in \chi}p(x)logp(x) H(X)=xχp(x)logp(x)

  • Joint Entropy
    The joint entropy H ( X , Y ) H(X,Y) H(X,Y) of a pair of discrete random variables ( X , Y ) (X,Y) (X,Y) with a joint distribution p ( x , y ) p(x,y) p(x,y) is defined as H ( X , Y ) = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) l o g p ( x , y ) H(X,Y)=-\sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}}p(x,y)logp(x,y) H(X,Y)=xXyYp(x,y)logp(x,y)

  • Conditional Entropy
    The conditional entropy H ( Y ∣ X ) H(Y|X) H(YX) of a pair of discrete random variables ( X , Y ) (X,Y) (X,Y) with a joint distribution p ( x , y ) p(x,y) p(x,y) is defined as H ( Y ∣ X ) = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) l o g p ( y ∣ x ) H(Y|X)=-\sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}}p(x,y)logp(y|x) H(YX)=xXyYp(x,y)logp(yx)

  • Information Gain
    The information gain G ( X ) G(X) G(X) is defined as G ( X ) = H ( Y ) − H ( Y ∣ X ) G(X)=H(Y)-H(Y|X) G(X)=H(Y)H(YX) which represents the importance of condition X X X for the entropy H ( Y ) H(Y) H(Y).

  • Gini impurity
    The gini impurity a error rate to check whether a set of data belongs to the same category which is defined as I G ( f ) = ∑ i = 1 m f i ( 1 − f i ) I_G(f)=\sum_{i=1}^mf_i(1-f_i) IG(f)=i=1mfi(1fi)
    Smaller it is, more likely the set of data is the same category.

Decision Tree

Different classification algorithms

  • ID3 algorithm
    For the same output class D i D_i Di, computes the information gain A g A_g Ag of every feature and find the maximum one. Then for each value in A g A_g Ag, find the different output classes D i D_i Di and repeat again if the maximum A g A_g Ag is more than the threshold ϵ \epsilon ϵ.
    This algorithm is used for discrete features.

  • C4.5 algorithm
    It has the same principle as ID3, and overcomes the four disadvantages in ID3:

    1. no consideration for continuous features
    2. different number of value in each feature can affect the result
    3. no consideration for the situation when there is a missing value in a feature
    4. no consideration for overfitting

      solution for overfitting:
      regularization for preliminary pruning

  • CART algorithm(classification and regression tree)

    1. the gini value would be the basis of node splitting for a classification tree
    2. the minimum variance of sample would be the basis of node splitting for a regression tree

Model evaluation

There are classifier and regression evaluation index for the classification and regression tree, respectively.

  • auc and roc curve
  • RMSE & quantiles of errors

note: more index referring previous article

The sklearn parameters

sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2,min_samples_leaf=1, max_features=None, random_state=None, min_density=None, compute_importances=None,max_leaf_nodes=None)
  • criterion: feature splitting method(‘gini’ or ‘entropy’)
  • max_depth: the maximum depth of the decision tree, and overcome overfitting
  • min_samples_leaf: the minimum number of sample included in leaf node

Code to draw decision tree:

# coding=utf-8

# sklearn中为我们准备的数据-iris
# iris有三种鸢尾花,山鸢尾花,变色鸢尾和维吉尼亚鸢尾
# 数据中有4个特征(feature)
# sepal length (花萼长度)
# sepal width (花萼宽度)
# petal lenth (花瓣长度)
# petal width (花瓣宽度)
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# 获取鸢尾数据
iris = load_iris()

# 用来做测试的数据下标
test_idx = [0,50,100]

# 用以训练的数据
train_target = np.delete(iris.target,test_idx)
train_data = np.delete(iris.data,test_idx,axis=0)

# 用以测试的数据
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]


# 决策树
clf = tree.DecisionTreeClassifier()
clf.fit(train_data,train_target)

# 打印出测试数据和决策树的预言数据
# 结果应该是一样的(即决策树能正确预测)
print ("test_target:")
print (test_target)
print ("predict:")
print (clf.predict(test_data))


# 将决策树可视化
# 需要pydot(我安装了兼容版本pydotplus)
# 同时需要Graphviz(请去官网www.graphviz.org下载)
from sklearn.externals.six import StringIO
import pydotplus

dot_data = StringIO()
tree.export_graphviz(clf,
                        out_file=dot_data,
                        feature_names=iris.feature_names,
                        class_names=iris.target_names,
                        filled=True,rounded=True,
                        impurity=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# 输出pdf,显示整个决策树的思维过程
graph.write_pdf("viz.pdf")

the decision tree is :

在这里插入图片描述

Reference

[1] http://blog.51cto.com/12482328/2105558

[2] https://blog.youkuaiyun.com/lanchunhui/article/details/51140053

[3] http://www.cnblogs.com/pinard/p/6050306.html

[4] https://www.jianshu.com/p/498ea0d8017d

[5] https://www.cnblogs.com/AlwaysT-Mac/p/6647192.html

Datawhale 是一个专注于数据科学和机器学习领域的开源社区,致力于为学习者提供高质量的学习资源和实践机会。该组织通过组队学习、开源课程以及竞赛基准方案等多种形式推动数据科学知识的普及和技术能力的提升。 在数据科学方面,Datawhale 提供了诸如《动手学数据分析》这样的精品入门课程[^5],帮助初学者系统性地掌握数据分析的基础技能。此外,还维护了一个名为 "China Competition Baseline" 的开源项目,该项目旨在为各类数据竞赛(如 Kaggle 和天池等平台的比赛)提供基础模板和参考代码,涵盖数据预处理、特征工程、模型构建及调参等关键步骤,使得参与者能够快速上手并深入理解比赛解决方案的设计思路[^1]。 对于机器学习领域,Datawhale 同样推出了多门相关课程,包括集成学习、基于Python的会员数据化运营、R语言数据科学、机器学习的数学基础以及李宏毅机器学习(含深度学习)等内容丰富的学习材料[^3]。这些课程不仅覆盖了理论知识,也注重实际应用,适合不同层次的学习者根据自身需求选择合适的内容进行学习。 如果你对参与具体的项目或者想要获取更多关于 Datawhale 组织的信息,可以访问其 GitHub 页面或是论坛版块来了解更多细节[^4]。 ```markdown ### 示例代码:使用Python实现简单的线性回归模型 ```python from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import numpy as np # 创建一些示例数据 X = np.random.rand(100, 1) * 100 y = 3 * X.squeeze() + 2 + np.random.randn(100) * 10 # 分割训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 创建线性回归模型实例 model = LinearRegression() # 训练模型 model.fit(X_train, y_train) # 输出系数和截距 print(f"Coefficient: {model.coef_[0]}, Intercept: {model.intercept_}") ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值