Scikit-learn Cookbook (二) --- Classifying Data with scikit-learn

本文介绍了使用决策树和随机森林进行基本分类的方法,包括创建数据集、模型拟合、评估准确性、调整模型参数等内容,并通过实例展示了如何利用这些算法进行高效分类。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、Doing Basic classifications with Decision Trees

1.创建数据集(make classification datasets)

from sklearn import datasets
X,y =datasets.make_classification(n_samples=1000,n_features=3,n_redundant=0)

2. Import the object and then fit the model

from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X,y)
# Out[47]: 
# DecisionTreeClassifier(class_weight=None, criterion='gini', 
# max_depth=None,max_features=None, max_leaf_nodes=None, 
# min_samples_leaf=1,min_samples_split=2, 
# min_weight_fraction_leaf=0.0,presort=False, random_state=None, 
# splitter='best')
preds = dt.predict(X)
(y==preds).mean()
# Out[49]: 1.0

3. Look at different accuracies with different n_features

n_features = 200
X,y =datasets.make_classification(750,n_features,n_informative=5)
import numpy as np
training = np.random.choice([True,False],p=[.75,.25],size=len(y))
for x in np.arange(1,n_features+1):
    dt = DecisionTreeClassifier(max_depth=x)
    dt.fit(X[training],y[training])
    preds = dt.predict(X[~training])
    accuracies.append((preds == y[~training]).mean())

import matplotlib.pyplot as plt
f,ax = plt.subplots(figsize=(7,5))
ax.plot(range(1,n_features+1),accuracies,color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")

这里写图片描述

From the above graph, we can see that wee can actually get pretty accuracy at a low max depth. Let’s take a closer look at the accuracy at low levels, say the first 15:

N=15
import matplotlib.pyplot as plt
f,ax = plt.subplots(figsize=(7,5))
ax.subplot(range(1,n_features+1)[:N],accuracies[:N],color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")

这里写图片描述

4. Tuning a Decision Tree Model

from sklearn import datasets,tree
X, y = datasets.make_classification(1000,20,n_informative =3)
dt = tree.DecisionTreeClassifier()
dt.fit(X,y)

from io import StringIO
import pydotplus

str_buffer = StringIO()
tree.export_graphviz(dt,out_file=str_buffer)
graph = pydotplus.graph_from_dot_data(str_buffer.getvalue())
graph.write_jpeg("myfile.jpg")
graph.write_pdf("myfile.pdf")

这里写图片描述

写成函数:

dt = tree.DecisionTreeClassifier(max_depth=5).fit(X,y)
def plot_dt(model, filename):
    str_buffer = StringIO()
    tree.export_graphviz(model,out_file=str_buffer)
    graph = pydotplus.graph_from_dot_data(str_buffer.getvalue())
    graph.write_jpg(filename)
plot_dt(dt,"myfile.png")   

这里写图片描述

5. Using many Decision Trees – random forests

from sklearn import datasets
X,y = datasets.make_classification(1000)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)

# To see how well we fit the training data
print("Accuracy:\t",(y == rf.predict(X)).mean())
print("Total Correct:\t",(y == rf.predict(X)).sum())
#Accuracy:        0.996
#Total Correct:   996

预测分类概率

probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs,columns=['0','1'])
probs_df['was_correct'] = rf.predict(X) == y

import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7,5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar',ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")

这里写图片描述
查看字段重要性分布

rf = RandomForestClassifier()
rf.fit(X, y)
f, ax = plt.subplots(figsize=(7,5))
ax.bar(range(len(rf.feature_importances_)),rf.feature_importances_)
ax.set_title("Feature Importances")

这里写图片描述

6.Tuning a random forest model

**# choices for max_features** 最大特征数量
from sklearn.metrics import confusion_matrix
max_feature_params = ['auto','sqrt','log2',.01,.5,.99]
confusion_matrixes = {}
for max_feature in max_feature_params:
    rf = RandomForestClassifier(max_features = max_feature)
    rf.fit(X[training],y[training])   
    y_pred = rf.predict(X[~training])
    confusion_matrixes[max_feature]=confusion_matrix(y[~training],y_pred).ravel()

import pandas as pd
confusion_df = pd.DataFrame(confusion_matrixes)

这里写图片描述

import itertools
from matplotlib import pyplot as plt
f,ax = plt.subplots(figsize = (7,5))
confusion_df.plot(kind='bar',ax=ax)
ax.legend(loc='best')
ax.set_title("Guessed vs Correct (i,j) where i is the guess and j is the actual")
ax.grid()
ax.set_xticklabels([str((i,j)) for i,j in list(itertools.product(range(2),range(2)))])
ax.set_xlabel("Guessed vs Correct")
ax.set_ylabel("Correct")

这里写图片描述

# Choices of n_estimators  -- The number of the trees in the forest
accuracy = lambda x : np.trace(x)/np.sum(x,dtype=float)

n_estimator_params = range(1,20)
confusion_matrixes = {}
for n_estimator in n_estimator_params:
    rf = RandomForestClassifier(n_estimators = n_estimator)
    rf.fit(X[training],y[training])
    confusion_matrixes[n_estimator] = confusion_matrix(y[~training],rf.predict(X[~training]))
    confusion_matrixes[n_estimator] = accuracy(confusion_matrixes[n_estimator])

accuracy_series = pd.Series(confusion_matrixes)

这里写图片描述

from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7,5))
accuracy_series.plot(kind='bar',ax=ax,color='k',alpha=.75)

ax.grid()
ax.set_title("Accuracy by Number of Estimators")
ax.set_ylim(0,1)
ax.set_ylabel("Accuracy")
ax.set_xlabel("Number of estimators")
plt.show()

这里写图片描述

About This Book ============================== Learn how to handle a variety of tasks with Scikit-Learn with interesting recipes that show you how the library really works Use Scikit-Learn to simplify the programming side data so you can focus on thinking Discover how to apply algorithms in a variety of situations Who This Book Is For ============================== If you're a data scientist already familiar with Python but not Scikit-Learn, or are familiar with other programming languages like R and want to take the plunge with the gold standard of Python machine learning libraries, then this is the book for you. In Detail ============================== Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across. The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值