Scikit-learn Cookbook (二) --- Classifying Data with scikit-learn

本文介绍了使用决策树和随机森林进行基本分类的方法,包括创建数据集、模型拟合、评估准确性、调整模型参数等内容,并通过实例展示了如何利用这些算法进行高效分类。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、Doing Basic classifications with Decision Trees

1.创建数据集(make classification datasets)

from sklearn import datasets
X,y =datasets.make_classification(n_samples=1000,n_features=3,n_redundant=0)

2. Import the object and then fit the model

from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X,y)
# Out[47]: 
# DecisionTreeClassifier(class_weight=None, criterion='gini', 
# max_depth=None,max_features=None, max_leaf_nodes=None, 
# min_samples_leaf=1,min_samples_split=2, 
# min_weight_fraction_leaf=0.0,presort=False, random_state=None, 
# splitter='best')
preds = dt.predict(X)
(y==preds).mean()
# Out[49]: 1.0

3. Look at different accuracies with different n_features

n_features = 200
X,y =datasets.make_classification(750,n_features,n_informative=5)
import numpy as np
training = np.random.choice([True,False],p=[.75,.25],size=len(y))
for x in np.arange(1,n_features+1):
    dt = DecisionTreeClassifier(max_depth=x)
    dt.fit(X[training],y[training])
    preds = dt.predict(X[~training])
    accuracies.append((preds == y[~training]).mean())

import matplotlib.pyplot as plt
f,ax = plt.subplots(figsize=(7,5))
ax.plot(range(1,n_features+1),accuracies,color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")

这里写图片描述

From the above graph, we can see that wee can actually get pretty accuracy at a low max depth. Let’s take a closer look at the accuracy at low levels, say the first 15:

N=15
import matplotlib.pyplot as plt
f,ax = plt.subplots(figsize=(7,5))
ax.subplot(range(1,n_features+1)[:N],accuracies[:N],color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")

这里写图片描述

4. Tuning a Decision Tree Model

from sklearn import datasets,tree
X, y = datasets.make_classification(1000,20,n_informative =3)
dt = tree.DecisionTreeClassifier()
dt.fit(X,y)

from io import StringIO
import pydotplus

str_buffer = StringIO()
tree.export_graphviz(dt,out_file=str_buffer)
graph = pydotplus.graph_from_dot_data(str_buffer.getvalue())
graph.write_jpeg("myfile.jpg")
graph.write_pdf("myfile.pdf")

这里写图片描述

写成函数:

dt = tree.DecisionTreeClassifier(max_depth=5).fit(X,y)
def plot_dt(model, filename):
    str_buffer = StringIO()
    tree.export_graphviz(model,out_file=str_buffer)
    graph = pydotplus.graph_from_dot_data(str_buffer.getvalue())
    graph.write_jpg(filename)
plot_dt(dt,"myfile.png")   

这里写图片描述

5. Using many Decision Trees – random forests

from sklearn import datasets
X,y = datasets.make_classification(1000)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)

# To see how well we fit the training data
print("Accuracy:\t",(y == rf.predict(X)).mean())
print("Total Correct:\t",(y == rf.predict(X)).sum())
#Accuracy:        0.996
#Total Correct:   996

预测分类概率

probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs,columns=['0','1'])
probs_df['was_correct'] = rf.predict(X) == y

import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7,5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar',ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")

这里写图片描述
查看字段重要性分布

rf = RandomForestClassifier()
rf.fit(X, y)
f, ax = plt.subplots(figsize=(7,5))
ax.bar(range(len(rf.feature_importances_)),rf.feature_importances_)
ax.set_title("Feature Importances")

这里写图片描述

6.Tuning a random forest model

**# choices for max_features** 最大特征数量
from sklearn.metrics import confusion_matrix
max_feature_params = ['auto','sqrt','log2',.01,.5,.99]
confusion_matrixes = {}
for max_feature in max_feature_params:
    rf = RandomForestClassifier(max_features = max_feature)
    rf.fit(X[training],y[training])   
    y_pred = rf.predict(X[~training])
    confusion_matrixes[max_feature]=confusion_matrix(y[~training],y_pred).ravel()

import pandas as pd
confusion_df = pd.DataFrame(confusion_matrixes)

这里写图片描述

import itertools
from matplotlib import pyplot as plt
f,ax = plt.subplots(figsize = (7,5))
confusion_df.plot(kind='bar',ax=ax)
ax.legend(loc='best')
ax.set_title("Guessed vs Correct (i,j) where i is the guess and j is the actual")
ax.grid()
ax.set_xticklabels([str((i,j)) for i,j in list(itertools.product(range(2),range(2)))])
ax.set_xlabel("Guessed vs Correct")
ax.set_ylabel("Correct")

这里写图片描述

# Choices of n_estimators  -- The number of the trees in the forest
accuracy = lambda x : np.trace(x)/np.sum(x,dtype=float)

n_estimator_params = range(1,20)
confusion_matrixes = {}
for n_estimator in n_estimator_params:
    rf = RandomForestClassifier(n_estimators = n_estimator)
    rf.fit(X[training],y[training])
    confusion_matrixes[n_estimator] = confusion_matrix(y[~training],rf.predict(X[~training]))
    confusion_matrixes[n_estimator] = accuracy(confusion_matrixes[n_estimator])

accuracy_series = pd.Series(confusion_matrixes)

这里写图片描述

from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7,5))
accuracy_series.plot(kind='bar',ax=ax,color='k',alpha=.75)

ax.grid()
ax.set_title("Accuracy by Number of Estimators")
ax.set_ylim(0,1)
ax.set_ylabel("Accuracy")
ax.set_xlabel("Number of estimators")
plt.show()

这里写图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值