第八章　Ensemble_learning

最新推荐文章于 2025-04-25 17:36:16 发布

GeekDengshuo

最新推荐文章于 2025-04-25 17:36:16 发布

阅读量371

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签：集成学习

本文链接：https://blog.youkuaiyun.com/qq_37904945/article/details/80319641

机器学习专栏收录该内容

14 篇文章

订阅专栏

本文介绍了集成学习的基本概念，包括Boosting和Bagging两种主要方法，并通过实例展示了如何使用Python的scikit-learn库来实现随机森林、投票分类器及AdaBoost算法。通过对比不同集成方法在相同数据集上的表现，帮助读者理解集成学习的优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Ensemble learning

根据集成学习的生成方式，集成学习可分成两大类：

Boosting:个体间存在强依赖关系，必须串行生成的序列化方法
Bagging&Randon Forest: 个体间学习器不存在强依赖关系，可以同时生成的并行化方法

loss function(损失函数)　以及　cost function(代价函数)的区别　定义

# 如何使用集成学习
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegresssion
from sklearn.svm import SVC

# 数据集的划分
X,y=make_moons(n_samples=500,noise=0.30,random_state=42)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42)

log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()

voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
                           voting='hard')
voting_clf.fit(X_train,y_train)

from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    print(clf.__class__.__name__ ,accuracy_score(y_test,y_pred))

编程实现Adaboost算法，以不剪枝的决策树为基学习器，在西瓜数据集上训练一个Adaboost集成．

1.先参考其他(hands-on machine learning with sklearn and tensorflow)

先看一个bagging的现成算法

# 以决策树桩的基学习器

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf=BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,
                         bootstrap=True,n_jobs=-1)
bag_clf.fit(X_train,y_train)
y_pred=bag_clf.predict(X_test)

from sklearn.ensamble import AdaBoostClassifier
ada_clf=AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=200,
                          algorithm="SAMME.R",learning_rate=0.5)
ada_clf.fit(X_trian,y_train)

Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME16 (which
stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).
When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the
predictors can estimate class probabilities (i.e., if they have a predict_proba()
method), Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands
for “Real”), which relies on class probabilities rather than predictions and generally
performs better.

| algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
| If ‘SAMME.R’ then use the SAMME.R real boosting algorithm.
| base_estimator must support calculation of class probabilities.
| If ‘SAMME’ then use the SAMME discrete boosting algorithm.
| The SAMME.R algorithm typically converges faster than SAMME,
| achieving a lower test error with fewer boosting iterations.

集成学习：
　　　　相当于各种学习器的合集，不仅需要对基学习器的了解，还要了解如何区结合使用
　　　　目前只能去使用sklearn的学习包完成集成学习的过程．但是对包的学习还不是很深入，只能一点一点慢慢学吧