***Sklearn(一): Probability calibration

原创已于 2022-05-24 18:13:58 修改 · 1.9k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#sklearn #机器学习 #算法

于 2018-09-19 15:21:25 首次发布

Sklearn 专栏收录该内容

27 篇文章

订阅专栏

本文介绍了在机器学习中使用概率校准的重要性，并详细解释了如何通过两种方法（等温回归和Platt缩放）对不同模型（如SVM、随机森林等）的概率输出进行校准，以提高预测准确性。

今天查阅了一下sklearn的Probability calibration，终于明白了为什么在使用机器学习算法拟合模型的时候，最好进行一下“概率校验”。

“Probability Calibration”

logistic regression，在拟合参数的时候采用的是“最大似然法”来直接优化log-loss,因此，logistic function本身返回的就是经过校验的probability。
Guassian_NaiveBayes，其应用有个前提假设：所有的特征向量是相互独立的。而在实际的工作中，特征向量集难免有冗余，彼此相关，因此利用Guassian_NaiveBayes拟合模型时，往往会over-confidence，所得probability多倾向于0或1。
RandomForest，与Guassian_NaiveBayes正好相反，由于其分类要旨是取所有分类器的平均，或采用‘服从多数’的策略，因此，RandomForest往往会under-confidence，所得probability多在(0，1)之间。
SupportVector，由于受到hard margin的影响，其预测probability多集中在(0，1)之间，与RandomForest相似，为under-confidence的情况。
上述几种model的probabilistic predictions plots如下所示：

从图1可以看出，只有logistic regression最近仅perfectly calibration，其他的均呈sigmoid，或transpose-sigmoid 形状（表明：这些模型不是under-confidence，就是over-confidence）。

为了解决上述模型的over-confidence，或under-confidence的情况，我们可以用“概率校验”来对prediction label进行概率估计。‘概率校验’的方法主要有两种：1)non-parameter isotonic regression；2)Platt’s scaling（sigmoid function）；二者适用场景分别为：
1)non-parameter isotonic regression：isotonic calibration is preferable for non-sigmoid calibration curves and in situations where large amounts of data are available for calibration.
2)Platt’s scaling（sigmoid function）: sigmoid calibration is preferable in cases where the calibration curve is sigmoid and where there is limited calibration data.

“概率校验”的操作方法如下：
1、将dataset分为train和test（可用cross_validation.train_test_split）。
2、用test去拟合校验概率模型；
3、用train去拟合机器学习模型；
4、将校验概率模型应用于已经拟合好的机器学习模型上。对机器学习模型的prediction结果进行调整。

利用不同的method进行“概率校验”，机器学习模型prediction的改进效果可以用brier score来进行评估，得分越低，说明“概率校验”效果越好。在python中可以用brier_score_loss来获得brier score。

以SVM为例，说明应用“概率校验”后的prediction改进情况，如下图：

如图1所示，未经校验的SVC为sigmoid形状，而经过校验的SVC形状与perfectly calibrated相近。图legend中标有用不同的校验方法进行校验时的brier score。

Function

#Probability Calibration

from sklearn.calibration import CalibratedClassifierCV
CalibratedClassifierCV(base_estimator=None,method='sigmoid',cv=3)
#base_estimator:用于拟合的机器学习模型
#method:概率校验的方法：{sigmoid,isotonic}
#cv:cv='prefit',表示base_estimator已经拟合过了；cv=integer，表示corss_validation的folds；

#attribute:
classes_   #the class labels
calibrated_classifiers_  #the list of calibrated classifiers,one for each corssvalidation fold
#methods:
fit(X,y,sample_weight) #params:(training data,target values,sample_weight)  return:calibrated model
get_params([deep]) #return model parameters
predict(X) #return predicted classes
predict_proba(X) #return posterior probabilities of classification
score(X,y,sample_weight) #params:(test data,true label,sample_weight)  return:mean accuracy
set_params(**params) #set the parameters of this estimator   return self

from sklearn.calibration import calibration_curve
calibration_curve(y_true,y_prob,normalize=False,n_bins=5)
#y_true: true targets
#y_prob: probabilities of the positive class
#normalize=True: 将最小y_prob映射为0，将最大y_prob映射为1
#n_bins：the number of bins

#returns: y axis:prob_ture(the true probability in each bin(fraction of positive));
#         x axis:prob_pred:the mean predicted probability in each bin;