本次 DataWhale 第二十三期组队学习,其开源内容的链接为:https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning
在上一次task中简单了解了bagging后,这一次对boosting进行简单的了解。这一次的组织提供的教程应该参考了李航的《统计学习方法》中第八章“提升方法”。书中首先引入了概率近似正确(PAC)学习的框架,同时也给出结论:强可学习与弱可学习是等价。这样一来,通过发现的“弱学习算法”提升至“强学习算法”的方法就应运而生了。AdaBoost把提高前一轮弱学习器错误分类样本的权值和加权多数表决的方法巧妙得结合在一起,同时书中提供了一个简单的算法实例。这里使用sklearn对AdaBoost算法进行建模。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns
# 加载数据集——葡萄酒数据集,案例来源:《《python机器学习(第二版》》
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols',
'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
print("Class labels",set(wine["Class label"]))
Class labels {1, 2, 3}
wine.head()
| Class label | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
# 简化案例,近考虑2,3类葡萄酒
wine = wine[wine['Class label'] != 1]
y = wine['Class label'].values
X = wine[['Alcohol', 'OD280/OD315 of diluted wines']].values
#将分类标签变成二进制编码
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
# 按8:2分割训练集和测试集
from sklearn.model_selection import train_test_split
# 按照y的类别进行等比例抽样
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 1, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((95, 2), (24, 2), (95,), (24,))
# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))
Decision tree train/test accuracies 0.916/0.875
# 使用sklearn实现Adaboost,其中,基分类器为决策树
'''
AdaBoostClassifier相关参数:
base_estimator:基本分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators:终止迭代的次数
learning_rate:学习率
algorithm:训练的相关算法,{'SAMME','SAMME.R'},默认='SAMME.R'
random_state:随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator = tree, n_estimators = 500, learning_rate = 0.1, random_state = 1)
ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train, y_train_pred)
ada_test = accuracy_score(y_test, y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))
Adaboost train/test accuracies 1.000/0.917
# 可视化决策边界
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows = 1, ncols = 2,sharex = 'col',sharey = 'row',figsize = (12, 6))
for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
clf.fit(X_train, y_train)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[idx].contourf(xx, yy, Z, alpha=0.3)
axarr[idx].scatter(X_train[y_train==0, 0],X_train[y_train==0, 1],c='blue', marker='^')
axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='red', marker='o')
axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.tight_layout()
plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3g8o7JAf-1618824582779)(output_11_0.png)]](https://i-blog.csdnimg.cn/blog_migrate/452bed9939148b773b0b190c8bb36e37.png#pic_center)
axarr[0]
<AxesSubplot:title={'center':'Decision tree'}, ylabel='Alcohol'>
可以看出AdaBoost的决策边界更为复杂,值的注意的是:与单个分类器相比,Adaboost等Boosting模型增加了计算的复杂度,而且Boosting方式无法做到现在流行的并行计算的方式进行训练,因为每一步迭代都要基于上一部的基本分类器。
本文通过实战演示了AdaBoost算法的应用过程,对比了单一决策树与AdaBoost在葡萄酒数据集上的表现,展示了AdaBoost如何通过迭代提升弱学习器的效果。
1万+

被折叠的 条评论
为什么被折叠?



