Notes : <Hands-on ML with Sklearn & TF> Chapter 6

 

 

 

 

 
  1. perform both classification and regression tasks and even mulioutput tasks
  2. how to train, visualize, and make predictions with Decision Tree
  3. CART training algorithm
  4. limitation of Decision Tree
 

Training and Visualizing a Decision Tree

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)
Out[2]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [9]:
from sklearn.tree import export_graphviz
import os

PROJECT_ROOT_DIR = "."
CHAPTER_ID = "decision_trees"
def image_path(fig_id):
    return os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id)

export_graphviz(
    tree_clf,
    out_file=image_path('iris_tree.dot'),
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)
 

Iris Decision Tree

 

Making Predictions

  1. $$ Gini\ impurity:\ G_i=1-\sum_{k=1}^{n}p_{i,k}^{\ \ \ \ \ 2} $$
  2. $p_{i,k}$是第i个实例分到第k类的比例
  3. white box
 

Estimating Class Probabilities

输出的时所属的非叶子节点的实例的比例

In [10]:
tree_clf.predict_proba([[5, 1.5]])
Out[10]:
array([[ 0.        ,  0.90740741,  0.09259259]])
In [11]:
tree_clf.predict([[5,1.5]])
Out[11]:
array([1])
 

The CART Training Algorithm

 
  1. sklearn use Classification And Regression Tree(CART) to train Decision Tree
  2. splite the train set in two purest subsets:$$ J(k,t_k) = \frac{m_{left}}{m} G_{left} + \frac{m_{right}}{m} G_{right} \\ where \left\{\begin{matrix} G_{left/right}\ measures\ the\ impurity\ of\ the\ left/right\ subset\\ m_{{left}/{right}}\ is\ the\ number\ of\ instances\ in\ the\ left/right\ subsets. \end{matrix}\right. $$
  3. reasonably good solution
 

Computational Complexity

  1. $O(log_2(m))$ for prediction and $O(n \times log_2{m})$ for train
 

Gini Impurity or Entropy

  1. entropy origined thermodynamics to measure molecular disorder
  2. in ml, it is frequently used as an impurity measure.
  3. $$ H_i=-\sum_{k=1,p_{i,k}\neq 0}^{n}p_{i,k}log(p_{i,k}) $$
  4. Gini is fast, Gini and Entropy lead to similar tree
  5. when they differ, Gini impurity tend to isolate the most frequent class in its own bench of tree, while entropy trens to produce slightly more balanced trees
 

Regularization Hyperparameters

  1. nonhyperparametric model:the number of parameters is not determined prior to training
  2. most likely to overfitting, so need to restrict the freedom called regularization
  3. generally this is controlled by max_depth
  4. DecisionTreeClassifier class has min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_leaf_nodes, max_features
  5. 也可先不加限制生成决策树,再剪枝。如果统计学不显著,就剪掉
In [17]:
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np

def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
        plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
        plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)
        
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)

deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)

plt.show()
 
 

Regression

In [18]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
Out[18]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
In [22]:
export_graphviz(
    tree_reg,
    out_file=image_path('moons_tree1.dot'),
    rounded=True,
    filled=True
)
 

Moons Tree

 
$$ cost\ function=J(k,t_k)=\frac{m_{left}}{m}MSE_{left}+\frac{m_{right}}{m}MSE_{right}\\ where\left\{\begin{matrix} MSE_{node}=\sum_{i \in {node}}(\widehat{y}_{node}-y^{(i)})^2 \\ \widehat{y}_{node}=\frac{1}{m_{node}}\sum{i \in {node}}y^{(i)} \end{matrix}\right. $$
 

Instability

 
  1. Decision Tree love orthogonal decision boundaries which make them sensitive to training set rotation; One way to limit this problem is to use PCA
  2. they are very sensitive to small variations in training set; Random Forests can limit this instability by averaging predictions over many trees
In [ ]:
 

转载于:https://www.cnblogs.com/yaoz/p/6964180.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值