辨析sklearn.metrics里的average参数:binary/micro/macro/weighted/samples

本文深入解析多标签分类任务中的评估指标,特别是不同average参数设置下的精确率计算方法,包括binary、micro、macro、weighted等,并通过实例帮助理解。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前言

最近在搞一个多标签分类项目,涉及到metrics的选择,正好趁此搞清楚metrics里常用的几种average方式,官方api(见下图)里的解释感觉有的关键点还是不清晰,还是要自己摸索计算验证一番。说到这里不得不吐槽一下,网上一搜全是直接把API里的翻译成中文,然后就发一篇博文,请问有什么意义呢?还不如不发,浪费查找者的精力。所以决定写一篇具体的分析,加深自己及对此有疑问的同道的理解。

 下面就开始分别介绍None、binary、micro、macro、weighted、samples这几个不同选项的含义。具体score计算举例使用precision_score来举例。

precision=TP/(TP+FP)

None

None官方给出的解释就比较清晰明确了,对每个类计算score,返回的是一个array,里面分别是每个类的score,例如下图。

上图中共有0、1两个类别,对于0来说(0作为Positive),预测正确的有2个(TP=2),预测错误的有1个(FP=1),故precision=2/3即array[0];对于1来说(1作为Positive),预测正确的有1个(TP=1),预测错误的有1个(FP=1),故precision=1/2即array[1]。

binary

average=binary情况下要求输入中的y_true和y_pred都是二值的(仅包含01),并涉及到score函数的另一个参数pos_label,pos_label即指定的视作positive_label的值,默认取1,即默认把1视作positive。average=binary就是仅计算指定pos_label的precision,计算举例如下图。

未指定pos_label时默认取1,即将1视作positive,预测正确的有1个(TP=1),预测错误的有1个(FP=1),precision=1/2;指定pos_label=0,即将0视作positive,预测正确的有2个(TP=2),预测错误的有1个(FP=1),precision=2/3。可以看到其实average=binary就是average=None(二值情况下)中的某一个。

micro

average=micro情况,就是计算以各类作为Positve时的预测正确TP的和再除以以各类作为Positve时的TP+FP,即sum(TP for Positive as 0,1,2...)/sum((TP+FP) for Positive as 0,1,2...),其实分母sum((TP+FP) for Positive in all_label_values)就是预测的总数len(y_pred)。需要注意的一点是,precision score的micro情况正好等于accuracy,但并不意味着average=micro就是acc,因为各种score都会有average=micro参数,仅仅是precision_score的average=micro的计算等于acc。计算举例如下图。

对于第一个例子,当把1视作Positive时,预测正确的有1个(TP=1),预测为1的有2个(TP+FP=2),当把0视作Positive时,预测正确的有2个(TP=2),预测为0的有3个(TP+FP=3),故precision=(TP_1+TP_0)/(TP_1+FP1+TP_0+FP_0)。很明显分母就是预测的数量len(pred),因为是把各类视作Positve再求和。第二个例子也是同样的计算过程,可以再验证一下。

macro

average=macro情况,与average=micro情况相对立,是先分别计算将各类视作Positive情况下的score,再求个平均,即average(TP/(TP+FP) for Positive as 0,1,2...)。计算举例如下图。

当把1视作Positive时,预测正确的有1个(TP=1),预测为1的有2个(TP+FP=2),precision_{1 as Positive}=1/2;当把0视作Positive时,预测正确的有2个(TP=2),预测为0的有3个(TP+FP=3),precision_{0 as Positive}=2/3。故\large precision=(precision_{1 as Positive}+precision_{0 as Positive})/2\large =(1/2+2/3)/2=7/12

weighted

上一部分分析的average=macro存在的一个问题是这种计算方式未考虑到各类别不平衡情况,因为最后除的是类别数意味着每个类别都被视作平等,而可能有的类别预测数量很大,有的类别预测数量很小,比如y_pred=[1,1,0,1,1,1,1,1,1]。而average=weighted就是对此进行改进,每个类别计算出的score的权重不再是1/类别数,而是在y_true中每个类别所占的比例。计算举例如下图。

当把1视作Positive时,预测正确的有1个(TP=1),预测为1的有2个(TP+FP=2),\large precision_{1asPositive}=1/2;当把0视作Positive时,预测正确的有1个(TP=1),预测为0的有3个(TP+FP=3),\large precision_{0asPositive}=1/3。在y_true中,1所占比例为3/5,0所占比例为2/5,故\large precision=3/5 * precision_{1 as Positive} + 2/5 * precision_{0 as Positive}\large =3/5 * 1/2 + 2/5 * 1/3=3/10+2/15=13/30

samples

samples官方api中指出是针对多标签的,但十分难理解,看源码发现还涉及csr_matrix转换的一些东西,输入形式有要求,还没搞清楚。不过在实际的多标签分类中,一般都是将y_true和y_label表示为二值向量,比如[0,1,0,0,1]表示label为1和4,如此便可使用以上的average来计算各种score了,不太明白这个samples的意义,就不展开了。


2022.5.15更新关于samples的理解

注意sklearn中特别说明samples是针对多标签(multi-label)情形的,它本身的表示就是与其它情形不一致的:在该情形下的输入是二维的(例如[[1,4], [2], ...]),而之前介绍的其它average情况下都是一维的(例如[1, 2, ...])。其实正是因为存在多标签问题,所以才需要额外加入一个维度。

在明确了输入维度差异后,根据官方API就能理解average=‘samples’是怎样计算的了,还是通过举例来说明这个过程比较清晰:

true = [[1],[2,3],[1],[4]]

pred = [[0],[2],[1,3],[4]]

对第0个sample(true[0]=[1],pred[0]=[0]):precision=0/1=0

对第1个sample(true[1]=[2,3], pred[1]=[2]):precision=1/1=1

对第2个sample(true[2]=[1], pred[2]=[1,3]):precision=1/2=0.5

对第3个sample(true[3]=[4], pred[3]=[4]):precision=1/1=1

则average=‘samples’情形下precision_score结果为 (0+1+0.5+1) / 4=0.625。

请格外注意这个分母,因为它是average=‘samples’与其它情形的第二个不同之处(第一个不同之处之前已经说过是输入维度不同),也就是说它是在样本数目的层面求平均的,而其它情形都是在label数目的层面求平均,这是一个重要的认知区别点。

看到上面的例子可能有人会觉得为啥手写计算过程,而不是像之前那样直接贴python运行截图呢?别急,解释完原因下面就来了,原因就是sklean对多标签的输入有要求,必须要转换成0/1向量,比如第1个sample中true对应为[2,3],但输入要转换成[0,0,1,1,0]形式,所以为了直观,在上面的手写计算过程中是没进行转换这一步的,下面贴代码运行截图。

 

到这里应该已经对samples有了清晰的认识了。

总结

到这里可以回头看一下average这个参数到底起什么作用。score的基本算法是根据选取的指标定下来的,比如本文中的例子都是算的precision_score,也就是TP/(TP+FP),而average则是指明各类间score的处理方式(samples情形例外),None就是每个类的score都列出来,binary就是只算pos_label的那个类的score,micro和macro就是算各类的平均,前者是先指标(TP、FP、TN、FN)求和再做除法求平均,后者是先算各类的score,再对各类score求平均,weighted则是在macro的基础上再进一步,计算的是加权平均,权重为y_true向量中的各类所占比例。samples对应的多标签情形独具一格,指明的是各sample score的处理方式。

以下是sklearn.metrics.roc_auc_score模块的源代码: ```python def roc_auc_score(y_true, y_score, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None): """Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores. Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters). Read more in the :ref:`User Guide <roc_metrics>`. Parameters ---------- y_true : array-like of shape (n_samples,) or (n_samples, n_classes) True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes). y_score : array-like of shape (n_samples,) or (n_samples, n_classes) Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by `decision_function` on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label. The multiclass and multilabel cases expect a shape (n_samples, n_classes). average : {'micro', 'macro', 'samples', 'weighted'} or None, \ default='macro' If ``None``, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data: ``'micro'``: Calculate metrics globally by counting the total true positives, false negatives and false positives. ``'macro'``: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account. ``'weighted'``: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance; it can result in an F-score that is not between precision and recall. ``'samples'``: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification). sample_weight : array-like of shape (n_samples,), default=None Sample weights. max_fpr : float or None, default=None If not ``None``, the standardized partial AUC [2]_ over the range [0, max_fpr] is returned. For the multiclass case, ``max_fpr`` should be either ``None`` or ``1.0`` as partial AUC makes sense for binary classification only. multi_class : {'raise', 'ovr', 'ovo'}, default='raise' Multiclass only. Determines the type of configuration to use. The default value raises an error, so either ``'ovr'`` or ``'ovo'`` must be passed explicitly. ``'ovr'``: Computes ROC curve independently for each class. For each class, the binary problem y_true == i or not is solved and the corresponding ROC curve is computed and averaged across classes. This is a commonly used strategy for multiclass or multi-label classification problems. ``'ovo'``: Computes pairwise ROC curve for each pair of classes. For each pair of classes, the binary problem y_true == i or y_true == j is solved and the corresponding ROC curve is computed. The micro-averaged ROC curve is computed from the individual curves and hence is agnostic to the class balance. labels : array-like of shape (n_classes,), default=None Multiclass only. List of labels to index ``y_score`` used for multiclass. If ``None``, the lexical order of ``y_true`` is used to index ``y_score``. Returns ------- auc : float or dict (if ``multi_class`` is ``'ovo'`` or ``'ovr'``) AUC of the ROC curves. If ``multi_class`` is ``'ovr'``, then returns an array of shape ``(n_classes,)`` such that each element corresponds to the AUC of the ROC curve for a specific class. If ``multi_class`` is ``'ovo'``, then returns a dict where the keys are ``(i, j)`` tuples and the values are the AUCs of the ROC curve for the binary problem of predicting class ``i`` vs. class ``j``. See also -------- roc_curve : Compute Receiver operating characteristic (ROC) curve. roc_auc : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores Examples -------- >>> import numpy as np >>> from sklearn.metrics import roc_auc_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> roc_auc_score(y_true, y_scores) 0.75 >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([[0.1, 0.9], [0.4, 0.6], [0.35, 0.65], [0.8, 0.2]]) >>> roc_auc_score(y_true, y_scores, multi_class='ovo') 0.6666666666666667 >>> roc_auc_score(y_true, y_scores[:, 1]) 0.75 """ # validation of the input y_score if not (y_true.shape == y_score.shape): raise ValueError("y_true and y_score have different shape.") if (not is_multilabel(y_true) and not is_multiclass(y_true)): # roc_auc_score only supports binary and multiclass classification # for the time being if len(np.unique(y_true)) == 2: # Only one class present in y_true. ROC AUC score is not defined # in that case. Note that raising an error is consistent with the # deprecated roc_auc_score behavior. raise ValueError( "ROC AUC score is not defined in that case: " "y_true contains only one label ({0}).".format( format_label(y_true[0]) ) ) else: raise ValueError( "ROC AUC score is not defined in that case: " "y_true has {0} unique labels. ".format(len(np.unique(y_true))) + "ROC AUC score is defined only for binary or multiclass " "classification where the number of classes is greater than " "one." ) if multi_class == 'raise': raise ValueError("multi_class must be in ('ovo', 'ovr')") elif multi_class == 'ovo': if is_multilabel(y_true): # check if max_fpr is valid in this case if max_fpr is not None and (max_fpr == 0 or max_fpr > 1): raise ValueError("Expected max_fpr in range (0, 1], got: %f" % max_fpr) return _multiclass_roc_auc_score_ovr(y_true, y_score, average, sample_weight, max_fpr=max_fpr) else: return _binary_roc_auc_score(y_true, y_score, average, sample_weight, max_fpr=max_fpr) elif multi_class == 'ovr': if is_multilabel(y_true): return _multilabel_roc_auc_score_ovr(y_true, y_score, average, sample_weight) else: return _multiclass_roc_auc_score_ovr(y_true, y_score, average, sample_weight, labels=labels) else: raise ValueError("Invalid multi_class parameter: {0}".format(multi_class)) ``` 这段代码实现了计算ROC AUC的功能,支持二元、多类和多标签分类。其中,分为'ovo'和'ovr'两种多类模式,'ovo'表示一对一,'ovr'表示一对多。
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值