集成算法(ensemble learning)
工程中我们进行训练时为了提高算法的可信度可能会有以下常用的方法:
- 使用多个分类器,这些分类器产生多个结果然后进行投票少数服从多数,当然我们也可以产生一个score值对这些值求平均,然后根据设定的阈值判断其分类。
- 对于特征明显的可以抽取出不同的特征分别训练,然后综合给出结果
使用集成算法。例如在人脸识别中我们先训练分类器C1分类正脸的,再把样本中剩下的错误的样本组成新的训练集(这个重采样的过程其实就是加权的过程,将前面分错的样本加大权值,让后面的更加的关注这些样本)在训练C2。最后我们把C1和C2加起来综合判断。这个过程叫做bootstrap。相应的如果C1和C2之间没有关系则是Bagging模型,如果C2的产生需要C1则是boosting模型。
1 Bagging(boostrap aggregation):训练多个分类器取平均值,得到训练的结果
典型的代表随机森林(random forest):随机森林是指我们使用n棵树,这n棵树间并行的进行训练模型,最终综合n棵树的结果得到模型。
随机是指 1.数据的采样是随机的(比如有100个数据,每棵树选择其中的60个数据,每棵树选择数据的时候是有放回的随机采样)2.特征选择随机(比如有10个特征,每棵树有放回的随机采样6个特征)
优势:1 处理高纬度的特征,且不用做特征的选取
2 训练完后,能够给出哪些特征比较重要(实现过程一般是把目标度量的特征进行破坏,带着这个破坏的特征再次进行训练和测试,查看破坏后的结果和破坏前的结果的对比。来看这个特征的重要性)
3 并行运行,速度快
4 树模型的好处都有可视化
注:随机森林中的多棵树进行叠加就相当于对数据进行了非常详细的切分这种切分会导致数据的过拟合,为了防止这种过拟合现象我们才使用了随机采样的概念。2 Boosting :串行访问,通过多个弱学习器的串行叠加得到一个好的训练模型。
典型代表是:Adaboost(相关方法在这里:http://blog.youkuaiyun.com/stranger_man/article/details/78374865 ) xgboost3 Stacking :这里主要是分阶段训练。比如说:第一阶段我们使用多个分类器基于每一个特征给出一个预测结果。第二阶段在选择一个分类器将第一阶段的结果当做特征进行训练。最终得到的结果就是我们想要的。
——————————— 下面是转载的一篇文章感觉很不错——————————–
1.boosting
boosting是几个不同的分类器的集成。(注意:这里的不同只是训练集不同,或者说同一训练集上样本权重不同,而分类器的基本类型没有不同!)不同的分类器是通过串行训练而获得的,每个新分类器都在已训练出的分类器的性能基础上再进行训练,通过集中关注被已有分类器错分的些数据来获得新的分类器。关注方法是:每次训练后根据此次训练集中的每个样本是否被分类正确以及上次的总体分类的准确率,来确定每个样本的权值。Boosting分类的结果是基于所有分类器的加权求和结果的,分类器权重并不相等,每个权重代表的是其对应分类器在上一轮迭代中的成功度。Boosting算法有很多种,AdaBoost(Adaptive Boost)就是其中最流行的。 每个弱分类器可以是机器学习算法中的任何一个,如logistic回归,SVM,决策树等。
其中:每个样本都有一个权重,初始为1/m,分错后权重增大。
每个分类器也有权重,根据错误率越小,最后所占的权重越大
训练:设置n个弱分类器,依次在同一样本集上分类,根据错误率确定a。训练最后得到n个分类器和对应n个a,
测试:判断结果=求和(ai*i的分类结果)
【怎么用】
from sklearn.ensemble importAdaBoostClassifier
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",n_estimators=200)
说明:参数n_estimators设置弱分类器的数量,参数learning_rate是每个弱分类器在最后决定中所做的贡献,就是上面所说的a。默认的弱分类器为决策树,不同的弱学习器可以通过参数base_estimator来确定。重要参数是弱分类器的数量:n_estimators和基本分类器的复杂性,它们对获得好的分类结果有重要的影响。
注意:用LR或者SVC作为弱分类器,出现问题。
按照我们最初始的想法,使用之前训练好的LR和SVC作为弱分类器。但是调用拟合函数fit的时候报错。分析原因是:虽然在sklearn中已经写好了adboosting 方法。但每次训练(fit)是要给出指定的sample_weight,而LR和SVC中的fit函数都不支持使用sample_weight。
2.Bagging
【是啥】
Bagging和Boosting一样,是一种组合基本分类器的方法,也就是使用多个基分类器来获取更为强大的分类器,其核心思想是有放回的抽样。
Bagging算法的训练流程:
1、从样本集中有放回的抽样M个样本。
2、用这M个样本训练基分类器C。
3、重复这个过程X次,得到若干个基分类器。
Bagging算法的预测流程:
1、对于新传入实例A,用这X个新分类器得到一个分类结果的列表。
2、若待分类属性是数值型(回归),求这个列表的算数平均值作为结果返回。
3、若待分类属性是枚举类型(分类),按这个列表对分类结果进行投票,返回票数最高的。
【怎么用】
from sklearn.ensemble importBaggingClassifier
from sklearn.neighborsimport KNeighborsClassifier
bagging =BaggingClassifier(KNeighborsClassifier(),
max_samples=0.5,max_features=0.5)
说明:用户指定基本分类器,例如上面的KNeighborsClassifier,同时调整一些参数来明确随机子集的筛选方法。特别的,通过max_sample和max_features控制子集的规模。
##下面是泰坦尼克号案例分析
import pandas as pd
taitanic = pd.read_csv("titanic_train.csv")
#taitanic.head()
print(taitanic.describe())
print(taitanic["Age"].median())
taitanic["Age"] = taitanic["Age"].fillna(taitanic["Age"].median())
print(taitanic.describe())
## 这里的loc用法非常新颖,希望注意一下。
taitanic.loc[taitanic["Sex"] == "female","Sex"] = 0
taitanic.loc[taitanic["Sex"] == "male","Sex"] = 1
print(taitanic.head())
print(taitanic["Embarked"].unique())
taitanic["Embarked"] = taitanic["Embarked"].fillna('S')
taitanic.loc[taitanic["Embarked"] == "S","Embarked"] = 0
taitanic.loc[taitanic["Embarked"] == "C","Embarked"] = 1
taitanic.loc[taitanic["Embarked"] == "Q","Embarked"] = 2
taitanic.head()
[‘S’ ‘C’ ‘Q’ nan]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 |
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
predictor = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
fold = KFold(taitanic.shape[0],n_folds=3,random_state=1)
predictions = []
lg = LinearRegression()
for train,test in fold:
train_predicts = taitanic[predictor].iloc[train,:]
test_targets = taitanic["Survived"].iloc[train]
lg.fit(train_predicts,test_targets)
predict = lg.predict(taitanic[predictor].iloc[test,:])
print(predict.shape)
#这个地方需要注意,三分predictions,按顺序排列 list!
predictions.append(predict)
print(len(predictions))
(297,)
1
(297,)
2
(297,)
3
import numpy as np
#print(predictions.shape)
predictions = np.concatenate(predictions,axis=0)
print(predictions.shape)
predictions[predictions <=.5] = 0
predictions[predictions > .5] = 1
accurcy = sum(predictions[predictions == taitanic["Survived"]])/len(predictions)
print(accurcy)
(891,)
0.261503928171
逻辑回归算法实现
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
#alg = LogisticRegression(random_state=1)
# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, taitanic[predictor], taitanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.794612794613
随机森林算法实现
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
predictor = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg = RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)
kf = cross_validation.KFold(taitanic.shape[0], n_folds=3, random_state=1)
print(kf)
scores = cross_validation.cross_val_score(alg, taitanic[predictor], taitanic["Survived"], cv=kf)
print(scores.mean())
sklearn.cross_validation.KFold(n=891, n_folds=3, shuffle=False, random_state=1)
0.792368125701
predictor = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg = RandomForestClassifier(random_state=1,n_estimators=100,min_samples_split=4,min_samples_leaf=2)
kf = cross_validation.KFold(taitanic.shape[0], n_folds=3, random_state=1)
#print(kf)
scores = cross_validation.cross_val_score(alg, taitanic[predictor], taitanic["Survived"], cv=kf)
print(scores.mean())
0.819304152637
通过调整bagging——随机森林的参数使我们的模型得到了显著地提升。下面我们在组合出几个特征来,继续模型的提升。
taitanic["Families"] = taitanic["SibSp"] + taitanic["Parch"]
taitanic["NameLength"] = taitanic["Name"].apply(lambda x: len(x))
import re
import pandas
# A function to get the title from a name.
def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
# Get all the titles and print how often each one occurs.
titles = taitanic["Name"].apply(get_title)
#print(titles.dtype)
#要是用的到数据的统计啊,表格数据的操作啊 真的是要优先考虑pandas的,numpy更多的用在矩阵的计算上的。
print(pandas.value_counts(titles))
# Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
titles[titles == k] = v
# Verify that we converted everything.
print(pandas.value_counts(titles))
# Add in the title column.
taitanic["Title"] = titles
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Major 2
Col 2
Mlle 2
Mme 1
Don 1
Sir 1
Jonkheer 1
Capt 1
Ms 1
Countess 1
Lady 1
Name: Name, dtype: int64
1 517
2 183
3 125
4 40
5 7
6 6
7 5
10 3
8 3
9 2
Name: Name, dtype: int64
特征选择
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import model_selection
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Families", "Title", "NameLength"]
# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(taitanic[predictors], taitanic["Survived"])
#print(selector.pvalues_)
# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
# Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
score = model_selection.cross_val_score(alg,taitanic[predictors],taitanic["Survived"],cv=3)
print(score.mean())
[ 2.53704739e-25 1.40606613e-69 5.27606885e-02 2.92243929e-01
1.47992454e-02 6.12018934e-15 1.40831242e-03 6.19891122e-01
1.03899613e-27 2.02679507e-24]
这里就是stacking算法的实例吧
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "Families", "Title",]],
[LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "Families", "Title", "Age", "Embarked"]]
]
kf = KFold(taitanic.shape[0], n_folds=3, random_state=1)
predictions = []
predictions = []
for train, test in kf:
train_target = taitanic["Survived"].iloc[train]
full_test_predictions = []
# Make predictions for each algorithm on each fold
for alg, predictors in algorithms:
# Fit the algorithm on the training data.
alg.fit(taitanic[predictors].iloc[train,:], train_target)
# Select and predict on the test fold.
# The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
test_predictions = alg.predict_proba(taitanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_predictions)
# Use a simple ensembling scheme -- just average the predictions to get the final classification.
test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
# Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
test_predictions[test_predictions <= .5] = 0
test_predictions[test_predictions > .5] = 1
predictions.append(test_predictions)
# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)
# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == taitanic["Survived"]]) / len(predictions)
print(accuracy)
0.279461279461
titles = taitanic["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
titles[titles == k] = v
taitanic["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(taitanic["Title"]))
# Now, we add the family size column.
taitanic["Families"] = taitanic["SibSp"] + taitanic["Parch"]
1 517
2 183
3 125
4 40
5 7
6 6
7 5
10 3
8 3
9 2
Name: Title, dtype: int64
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "Families", "Title"]
algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
[LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "Families", "Title", "Age", "Embarked"]]
]
full_predictions = []
for alg, predictors in algorithms:
# Fit the algorithm using the full training data.
alg.fit(taitanic[predictors], taitanic["Survived"])
# Predict using the test dataset. We have to convert all the columns to floats to avoid an error.
predictions = alg.predict_proba(taitanic[predictors].astype(float))[:,1]
full_predictions.append(predictions)
# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions