Python 第三方模块机器学习 Scikit-Learn模块有监督学习6 集成学习-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_46131409/article/details/116033983

本文介绍了Python的Scikit-Learn库在机器学习中的应用，特别是聚焦于集成学习和有监督学习的实践。内容涵盖集成学习的基本概念，包括分类和回归任务的实现，并探讨了异常检测及数据转换等技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.ensemble
1.简介:

该模块用于"集成学习"(ensemble learning)

2.分类:

"自适应提升分类器"(AdaBoost classifier):class sklearn.ensemble.AdaBoostClassifier([base_estimator=None,n_estimators=50,learning_rate=1.0,algorithm='SAMME.R',random_state=None])
  #参数说明:
	base_estimator:指定基本估计器;为object
	n_estimators:指定提升中使用的最大估计器数;为int
	learning_rate:指定学习率;为float
	  #在n_estimators和learning_rate间存在1个权衡
	algorithm:指定使用的提升算法;为"SAMME"/"SAMME.R"
	random_state:指定使用的随机数;为int/RandomState instance/None

######################################################################################################################

"装袋分类器"(Bagging classifier)/"自举汇聚分类器"(Bootstrap aggregating classifier):class sklearn.ensemble.BaggingClassifier([base_estimator=None,n_estimators=10,max_samples=1.0,max_features=1.0,bootstrap=True,bootstrap_features=False,oob_score=False,warm_start=False,n_jobs=None,random_state=None,verbose=0])
  #参数说明:其他参数同class sklearn.ensemble.AdaBoostClassifier()
	max_samples,max_features:分别指定用于训练每个基本估计器的样本/特征数;均为int/float
	bootstrap,bootstrap_features:分别指定是否通过替换来提取样本/特征;均为bool
	oob_score:指定是否使用"包外样本"(out-of-bag samples)来估计泛化误差;为bool
	warm_start:指定是否启用热启动;为bool
	n_jobs:指定用于并行计算的任务数;为int
	verbose:指定输出信息的冗余度;为int/bool

######################################################################################################################

"极端随机树分类器"(Extremely Randomized Trees classifier/ExtRa-trees classifier):class sklearn.ensemble.ExtraTreesClassifier([n_estimators=100,criterion='gini',max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,bootstrap=False,oob_score=False,n_jobs=None,random_state=None,verbose=0,warm_start=False,class_weight=None,ccp_alpha=0.0,max_samples=None])
  #参数说明:其他参数同class sklearn.ensemble.BaggingClassifier()
	criterion:指定用于衡量拆分质量的标准;为"gini"/"entropy"
	max_depth:指定树的最大深度;为int
	min_samples_split:指定拆分内部节点所需的最小样本数;为int
	  #若属于某内部节点的样本少于该值,则停止拆分
	min_samples_leaf:指定叶节点中的最小样本数;为int
	  #若属于某叶节点的样本少于该值,则不进行该拆分
	min_weight_fraction_leaf:The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node;为float
	max_features:指定使用的最大特征数;为"auto"/"sqrt"/"log2"/int/float
	max_leaf_nodes:指定最大叶节点数;为int
	min_impurity_decrease:指定继续拆分所需的最小改进;为float
	  #若改进小于该值,则停止拆分
	min_impurity_split:指定继续拆分所需的最小损失;为float
	  #若当前节点的损失小于该值,则停止拆分
	class_weight:指定各个特征的权重;为dict/dict list/"balanced"/"balanced_subsample"
	ccp_alpha:指定用于"最小成本-复杂性剪枝"(Minimal Cost-Complexity Pruning)的"复杂性参数"(Complexity parameter);为float>=0

######################################################################################################################

"用于分类的梯度提升"(Gradient Boosting for classification):class sklearn.ensemble.GradientBoostingClassifier([loss='deviance',learning_rate=0.1,n_estimators=100,subsample=1.0,criterion='friedman_mse',min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_depth=3,min_impurity_decrease=0.0,min_impurity_split=None,init=None,random_state=None,max_features=None,verbose=0,max_leaf_nodes=None,warm_start=False,validation_fraction=0.1,n_iter_no_change=None,tol=0.0001,ccp_alpha=0.0])
  #参数说明:其他参数同class sklearn.ensemble.ExtraTreesClassifier()
	loss:指定损失函数;为"deviance"/"exponential"
	learning_rate:指定学习率;为float
	subsample:指定用于拟合每个基本估计器的样本比例;为float
	criterion:指定用于衡量拆分质量的标准;为"friedman_mse"/"mse"/"mae"
	init:指定初始预测;为estimator/"zero"
	validation_fraction:指定作为提前终止时的验证集的比例;为0≤float≤1
	n_iter_no_change:指定改进小于tol时的最大迭代次数;为int
	tol:指定最小改进;为float
	  #若2次迭代间的损失改进小于该值,则停止

######################################################################################################################

"随机森林分类器"(random forest classifier):class sklearn.ensemble.RandomForestClassifier([n_estimators=100,criterion='gini',max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,bootstrap=True,oob_score=False,n_jobs=None,random_state=None,verbose=0,warm_start=False,class_weight=None,ccp_alpha=0.0,max_samples=None])
  #参数说明:同class sklearn.ensemble.ExtraTreesClassifier()

######################################################################################################################

"带有最终分类器的估计器堆栈"(Stack of estimators with a final classifier):class sklearn.ensemble.StackingClassifier(<estimators>[,final_estimator=None,cv=None,stack_method='auto',n_jobs=None,passthrough=False,verbose=0])
  #参数说明:其他参数同class sklearn.ensemble.BaggingClassifier()
    estimators:指定基本估计器;为list,格式为(str,estimator)
	final_estimator:指定用于结合基本估计器的分类器;为estimator
	cv:指定交叉验证的拆分策略;为int/cross-validation generator/iterable/None
	stack_method:指定为每个基本估计器调用的方法;为"auto"/"predict_proba"/"decision_function"/"predict"
	passthrough:为False时,只使用估计器得到的预测值作为final_estimator的训练数据
				为True时,也使用原始训练数据作为final_estimator的训练数据

######################################################################################################################

用于"unfitted estimators"的"软投票"(Soft Voting)/"多数规则分类器"(Majority Rule classifier):class sklearn.ensemble.VotingClassifier(<estimators>[,voting='hard',weights=None,n_jobs=None,flatten_transform=True,verbose=False])
  #参数说明:其他参数同class sklearn.ensemble.StackingClassifier()
	voting:指定投票策略;为"hard"/"soft"
	  #If "hard",uses predicted class labels for majority rule voting
	  #If "soft",predicts the class label based on the argmax of the sums of the predicted probabilities,which is recommended for an ensemble of well-calibrated classifiers
	weights:指定各个类别的权重;为1×n_classifiers array-like
	flatten_transform:指定是否压扁输出;为True(n_samples×n_classifiers*n_classes)/False(n_classifiers×n_samples×n_classes)
	  #仅当voting="soft"时生效

######################################################################################################################

"基于直方图的梯度提升分类树"(Histogram-based Gradient Boosting Classification Tree):class sklearn.ensemble.HistGradientBoostingClassifier([loss='auto',learning_rate=0.1,max_iter=100,max_leaf_nodes=31,max_depth=None,min_samples_leaf=20,l2_regularization=0.0,max_bins=255,categorical_features=None,monotonic_cst=None,warm_start=False,early_stopping='auto',scoring='loss',validation_fraction=0.1,n_iter_no_change=10,tol=1e-07,verbose=0,random_state=None])
  #参数说明:其他参数同class sklearn.ensemble.GradientBoostingClassifier()
    loss:指定损失函数;为"auto"/"binary_crossentropy"/"categorical_crossentropy"
    max_iter:指定提升过程的最大迭代次数;为int
    l2_regularization:指定L2正则化惩罚项的系数;为float
    max_bins:指定用于非缺失值的最大bin数;为int
    categorical_features:指定所有类别型的特征;为int/bool 1×n_features/1×n_categorical_features array-like
    monotonic_cst:指定对各个特征的"单调约束"(monotonic constraint);为1×n_features array-like,仅包含0(无约束)/1(正约束)/-1(负约束)
    early_stopping:指定是否启用"早停法"(early stopping);为"auto"/bool
    scoring:指定用于早停法的评分策略;为str/callable/None

3.回归:

"自适应提升回归器"(AdaBoost regressor):class sklearn.ensemble.AdaBoostRegressor([base_estimator=None,n_estimators=50,learning_rate=1.0,loss='linear',random_state=None])
  #参数说明:其他参数同class sklearn.ensemble.AdaBoostClassifier()
	loss:指定损失函数;为"linear"/"square"/"exponential"

######################################################################################################################

"装袋回归器"(Bagging regressor)/"自举汇聚回归器"(Bootstrap aggregating regressor):class sklearn.ensemble.BaggingRegressor([base_estimator=None,n_estimators=10,max_samples=1.0,max_features=1.0,bootstrap=True,bootstrap_features=False,oob_score=False,warm_start=False,n_jobs=None,random_state=None,verbose=0])
  #参数说明:同class sklearn.ensemble.BaggingClassifier()

######################################################################################################################

"极端随机树回归器"(Extremely Randomized Trees regressor/ExtRa-trees regressor):class sklearn.ensemble.ExtraTreesRegressor([n_estimators=100,criterion='mse',max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,bootstrap=False,oob_score=False,n_jobs=None,random_state=None,verbose=0,warm_start=False,ccp_alpha=0.0,max_samples=None])
  #参数说明:其他参数同class sklearn.ensemble.ExtraTreesClassifier()
	criterion:指定用于衡量拆分质量的标准;为"mse"/"mae"

######################################################################################################################

"用于回归的梯度提升"(Gradient Boosting for regression):class sklearn.ensemble.GradientBoostingRegressor([loss='ls',learning_rate=0.1,n_estimators=100,subsample=1.0,criterion='friedman_mse',min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_depth=3,min_impurity_decrease=0.0,min_impurity_split=None,init=None,random_state=None,max_features=None,alpha=0.9,verbose=0,max_leaf_nodes=None,warm_start=False,validation_fraction=0.1,n_iter_no_change=None,tol=0.0001,ccp_alpha=0.0])
  #参数说明:其他参数同class sklearn.ensemble.GradientBoostingClassifier()
	loss:指定损失函数;为"ls"/"lad"/"huber"/"quantile"
	alpha:指定huber/quantile loss function的分位数;为float
	  #仅当loss="huber"/"quantile"时有效

######################################################################################################################

"随机森林回归器"(random forest regressor):class sklearn.ensemble.RandomForestRegressor([n_estimators=100,criterion='mse',max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,bootstrap=True,oob_score=False,n_jobs=None,random_state=None,verbose=0,warm_start=False,ccp_alpha=0.0,max_samples=None])
  #参数说明:同class sklearn.ensemble.ExtraTreesRegressor()

######################################################################################################################

"带有最终回归器的估计器堆栈"(Stack of estimators with a final regressor):class sklearn.ensemble.StackingRegressor(<estimators>[,final_estimator=None,cv=None,n_jobs=None,passthrough=False,verbose=0])
  #参数说明:其他参数同class sklearn.ensemble.StackingClassifier()
	final_estimator:指定用于结合基本估计器的回归器;为estimator

######################################################################################################################

用于"unfitted estimators"的"预测投票回归器"(Prediction voting regressor):class sklearn.ensemble.VotingRegressor(<estimators>[,weights=None,n_jobs=None,verbose=False])
  #参数说明:其他参数同class sklearn.ensemble.VotingClassifier()
	weights:指定各预测值的权重;为1×n_regressors array-like

######################################################################################################################

"基于直方图的梯度提升回归树"(Histogram-based Gradient Boosting Regression Tree):class sklearn.ensemble.HistGradientBoostingRegressor([loss='least_squares',learning_rate=0.1,max_iter=100,max_leaf_nodes=31,max_depth=None,min_samples_leaf=20,l2_regularization=0.0,max_bins=255,categorical_features=None,monotonic_cst=None,warm_start=False,early_stopping='auto',scoring='loss',validation_fraction=0.1,n_iter_no_change=10,tol=1e-07,verbose=0,random_state=None])
  #参数说明:其他参数同class sklearn.ensemble.HistGradientBoostingClassifier()
    loss:指定损失函数;为"least_squares"/"least_absolute_deviation"/"poisson"

4.其他
(1)异常检测:

"孤立森林算法"(Isolation Forest Algorithm):class sklearn.ensemble.IsolationForest([n_estimators=100,max_samples='auto',contamination='auto',max_features=1.0,bootstrap=False,n_jobs=None,random_state=None,verbose=0,warm_start=False])
  #参数说明:其他参数同class sklearn.ensemble.BaggingClassifier()
	contamination:指定数据中异常值的比例;为"auto"/0<=float<=0.5

(2)数据转换:

"完全随机的树的1个集成"(An ensemble of totally random trees):class sklearn.ensemble.RandomTreesEmbedding([n_estimators=100,max_depth=5,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,sparse_output=True,n_jobs=None,random_state=None,verbose=0,warm_start=False])
  #An unsupervised transformation of a dataset to a high-dimensional sparse representation
  #参数说明:其他参数同class sklearn.ensemble.ExtraTreesClassifier()
	sparse_output:指定是否返回(稀疏)CSR矩阵(还是返回密集数组);为bool