模型评估

最新推荐文章于 2024-04-15 19:27:51 发布

yzcwansui

最新推荐文章于 2024-04-15 19:27:51 发布

阅读量292

点赞数

CC 4.0 BY-SA版权

分类专栏： math

本文链接：https://blog.youkuaiyun.com/yzcwansui/article/details/90733146

math 专栏收录该内容

2 篇文章

订阅专栏

classification

Confusion matrix

One dimension is Predicter values, other is Actual values, To binary classification:

The Actual value

True
False

The Predicted Value

Positive
Negative

和confusion matrix相关的公式如下：

The meaning of TP, TN, FP, FN:

TP(True Positive)
predicted positvie and it’s true
TN(True Negative)
predicted negative and it’s false
FP(False Positive)(Type 1 Error)
predicted positive and it’s false

FP is type I error.
type I error:最后算出的结果落在了原假设的拒绝域，实际上原假设是真的，但是犯了拒绝原假设的错误
FN(False Negative)(Type 2 Error)
predicted negative and it’s false

FN is type II error.
type I error:最后算出的结果没有错在拒绝域，但是实际上原假设是假的，此时依旧接受了原假设

FP 和 FN 本质是假设检验中的Type I error 和 Type II error。

FP 和 FN的例子：
根据样本值大小绘制曲线：蓝色是Positvie，红色是Negative，从图中看出样本之间有一个顺序。其预测的过程可以大致将其看成一个排序问题，将样本排好序，然后选择一个阈值，将样本分成Positvie 和 Negative
在这里插入图片描述
黑色框选中的蓝色×，实际是Positvie的，但是其值落在了拒绝域的范围，因此它被判定成Negative，属于FP，Type I error。

在这里插入图片描述
当改变分割的阈值，红框的X依旧是Type I error，并且有一个红色的Negatvie样本，落在了Posiive取值范围内，蓝色框中的样本点属于FP，Type II error

There is a funny picture is helpful to understand these
在这里插入图片描述

Accuracy

Describing the closeness of a measurement to the true value.In binary classification, Accuracy can be indicated by TP, TN, FP, FN:
$=\frac{n^{correct}}{n^{total}} = \frac{TP + TV}{TP+TN+FP+FN}$

advantage:

简单直观

disadvantages：

样本比例不均衡，占比大的样本往往称为影响准确率的最重要的因素
for example：负样本占99%, 如果全部预测成负样本，accuracy = 99%，但是并不能真正反映模型好坏
总体准确率很高，但是可能对于某个类别，正确率很低。

Precision

分类正确的正样本个数，占分类器判定为正样本个数的比例，Precision主要突出准确性
$\frac{TP}{TP+FP}$

Recall

分类器正确的样本个数占真正样本个数的比例, Recall 主要突出全面性
$\frac{TP}{TP+FN}$

Precision 和 Recall 是即矛盾又统一的两个指标，为了提高Precision，分类器需要尽量在更有把握时才把样本预测为正样本，但此时往往会因为过于保守而漏掉很多没有把握正样本，导致Recall降低

P-R曲线

在这里插入图片描述

横轴：Recall
纵轴：Precision
曲线上一点代表在某一阈值下，模型将大于该阈值的结果定为正样本，小于该阈值的结果定义为负样本
整条P-R曲线是通过将阈值从高到底移动而成的
原点附近代表当阈值最大时，模型的精确度和召回率
只用某个点对应的Precision and Recall 是不全面地衡量模型性能，只有通过P-R曲线的整体表现，才能对模型进行更为全面的评估
若一个学习器的P-R曲线被另一个学习曲线完全包住，则可断言厚泽性能优于前者，发生交叉，难以一般性断言两者孰优孰路劣，若要分出高低，可以比较P-R曲线下的面积，一定程度上表征了学习器在查准率和查全率上取得相对“双高”的比例，但是其值计算难度较大
平衡点（Break-Even Point）也是一个综合考量recall和precision的性能度量。通过度量 recall = precision时的取值(越大越好)

在这里插入图片描述
如图所示，基于BEP的比较，可认为学习器A优于B

F1 score

除了P-R曲线，F1 socre也能够综合地反映一个排序模型的性能。
The F1 score is based on the harmonic mean

Harmonic mean（调和平均值）

$\frac{n}{\frac{1}{x_{1}} + \frac{1}{x_{2}} + ...+ \frac{1}{x_{n}}} = \frac{n}{\sum_{i = 1}^{n} \frac{1}{x_{i}}} = (\frac{\sum_{i = 1}^{n} \frac{1}{x_{i}}}{n})^{-1}$

The result is not sensitive to extremely large value.On the other hand, all outliers are ingored.Extremely low values have a significant influence on the result.

F1 score

F1 score is based on precison and recall
$F_{1} = (\frac{Recall^{-1} + Precision^{-1}}{2})^{-1} = 2\frac{Recall \times Precision}{Recall + Precision}$

在这里插入图片描述
从图中可以看出在Precision = 1, Rrecall = 0 或者Precision = 0, Rrecall = 1的时候，F1 score依旧是0，在 Precision = 1, Rrecall = 1的时候取得最大值。

F1 score emphasizes the lowest value.If one of the parameter is small, the second one no longer matter.
If the F1 score is high, both precision and recall of the classifier indicate good results
If the F1 is low, we can not tell whether it has problems with false positive or false negatives

$F\beta$ Score

The formula of $F\beta$ is:
$F_{\beta} = (1+\beta^{2}) \cdot \frac{precision \cdot recall }{(\beta^{2} \cdot precision) + recall} = (\frac{1}{1+\beta ^ {2}} \cdot (\frac{1}{precision} + \frac{\beta^{2}}{recall}))^{-1}$

Because we muptiply only one parameter of the denominator by $\beta$ -squared, we can use $\beta$ to make $KaTeX parse error: Expected 'EOF', got '\F' at position 1: \̲F̲]beta$ more sensitive to low value of either precision or recall.

$\beta = 1$ 度量了precision 对 recall的相对重要性
$\beta > 1$ recall 影响更大， $\frac{\beta^{2}}{recall}$ , 因为有 $\beta$ 的加成，随着recall变小而变大的速率加快，因为相加取倒数的结果往往由较大一方决定量级，所以 $\frac{\beta^{2}}{recall}$ 在相加求和操作中更容易占主导地位。
$\beta < 1$ precision影响更大，原理与 $\beta > 1$ 相反。

当 $\beta = 2$ 时
在这里插入图片描述
这个图坐precision和recall 坐标[0, 1]反了，但是依旧能说明问题。从图中可以看出当recall < 0.2的时候，precision基本失效，F2 score 由recall来主导。

When should we use $F\beta$ score instead of F1 score?

In cases, when one of the metrix (precision or recall) is more important from the business perspective. It depends, how we are going to use classifer and what kind of errors is more problematic.

macro-F1 and micro-F1

有的时候我们可能有多个confusion mtrix

进行措辞训练/测试
多个数据集上进行训练/测试，希望估计算法的全局性能
执行多分类任务，每两两类别组合都对应一个混淆矩阵

总之当我们希望在n个confusion matrix上综合考察recall和precision的时候，需要用到macro-F1，micro-F1。

macro-F1
将各个confusion matrix上的 recall 和 precision 计算出来，记作 $P_{1}, R_{1}), (P_{2}, R_{2})...(P_{n}, R_{n})$ 再计算平均值，得到 macro-P 和 macro-R，进而计算出 macro-F1
$\frac{1}{n} \sum_{i = 1}^{n} P_{i} \\ macro-R = \frac{1}{n} \sum_{i = 1}^{n} R_{i}\\ macro-F1 = \frac{2 \times macro-P \times macro-R}{macro-P + macro-R}$
micro-F1
先计算出各个confusion matrix 的TP, FP, TN, FN，然后再求出其对应的平均值 $\overline{TP}$ ， $\overline{FP}$ ， $\overline{TN}$ ， $\overline{FP}$ 。基于这些平均值计算出micro-P，micro-R，和micro-F1

$\frac{\overline{TP}}{\overline{TP} + \overline{FP}} \\ micro-R = \frac{\overline{TP}}{\overline{TP} + \overline{FN}} \\ micro-F1 = \frac{2 \times micro-P \times micro-R}{micro-P + micro-R}$

ROC curve

再不同任务中，我们可以根据任务需求来采用不同的截断点，例如若我们更重视precision，则可以选择排序中考前的位置来进行阶段；若更重视recall，则可再靠后的位置进行截断。因此排序本身的质量好坏，体现了综合考虑学习器再不同任务下的“期望泛化误差”的好坏，ROC 曲线就是从这个角度出发来研究学习器繁华性能的有力工具。

ROC纵轴：（TPR）
$\frac{TP}{TP + FN}$
ROC横轴：（FPR）
$\frac{FP}{TN + FP}$

特殊点的诠释

(0, 0)
将所有点都划分为negtive TP = 0， TPR = 0， FP = 0， FPR = 0。
(1, 1)
将所有点全部划分为positive，FN = 0， TPR = 1, TN = 0，FPR = 1。
(0, 1)
FN = 0 TPR = 1， FP = 0, FPR = 0. 所以该点代表所有positive 排在所有negtive之前的理想模型

曲线的特殊位置

在这里插入图片描述
This is an ideal situation. When two curves don’t overlap at all means model has an ideal measure of separability, It is perfectly able to distinguish between positive class and negative class

When two distributions overlap, we introduce type I and type 2 error.Depending upon the threshold, we can minimize or maximize them.When AUC is 0.7, it means there is 70% chance that model will be able to distinguish between ppositive class ad negative class.

在这里插入图片描述
This is the morst situation.When AUC is approximately 0.5, model has no discrimination capacity to distinguish between positive class and negative class.

When AUC is approximately 0, model is actually reciprocating the calsses, It means, model is predicting negative class as a positvie class and vice versa.

ROC曲线的绘制

给定 $m^{+}$ 个正例， $m^{-}$ 个负例，根据学习器进行排序，然后把分类阈值设为最大，即把所有样例均预测成反例，此时TPR和FPR 都为0，在坐标(0, 0)点做一个标记，然后，将分类阈值一次设置为每个样例的预测值，即一次将每个样例划分为正例。设前一个标记点坐标为(x, y), 当前若为真，则对应标记点的坐标为 $y+\frac{1}{m_{+}})$ , 若为假正例，则对应点的坐标为 $y+\frac{1}{m_{-}})$ , 然后连接线段即可

AUC的计算

与P-R 图类似，若一个学习器的ROC曲线被另一个学习器曲线完全包裹，则可断言后者性能优于前者，若两个曲线发生交叉，则难以一般性地断言两者孰优孰劣，此时如果一定要进行比较，较为合理的判断依据是ROC曲线下面积，即AUC(Area Under ROC Curve)

假定ROC曲线坐标上的点为 ${(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}$ , 则AUC可估计为
$\frac{1}{2}\sum_{i = 1}^{m - 1}(x_{i - 1} - x_{i})(y_{i} + y_{i+1})$

AUC的统计学解释

TPR，和FPR分别可以看作一个概率分布。在假设事件X是学习器计算x的排序得分，在给定一个阈值参数T，在X>T，样本被分类为positive的前提下：

TPR
X遵循概率密度函数 $f_{1}(x)$ , 该样本真是类别是positive的概率。因此true positive rate可以定义为
$P(x\ is\ positvie | X> T) = \int_{T}^{+\infty} f_{1}(x)$
FPR
X遵循概率密度函数 $f_{0}(x)$ , 如果该样本真是类别是negative的概率。因此true positive rate可以定义为
$P(x\ is\ negative | X> T) = \int_{T}^{+\infty} f_{0}(x)$

结合之前的内容，可以理解为，对于每一个模型，TPR和FPR本身是一个概率分布，通过不断采样，运用统计学的知识回推TPR和FPR的分布。通过得到的TPR和FPR对模型进行评估。

因此AUC除了ROC的面积之外，还有一个probabilistic interpretation(推导太难，不会考。。。略过):

The AUC is the probability the model will score a randomly chosen positive class higher than a randomly 
chosen negative class.

也就是说，随机选择一个正样本和一个负样本，正样本得分大于负样本的概率，就是AUC的值。

如果正样本和负样本得分相同，那么就随机判定，即正样本大于负样本的概率为1/2，小于负样本的概率也为1/2。

也正是因为他的概率解释，计算AUC不必先画出ROC曲线，可以直接根据概率含义，根据样本点进行计算。在有M个正样本， N个负样本的数据集里，一共有 $M\times N$ 对样本(一对样本：一个正例一个负例)统计 $M\times N$ 对样本里，正样本的预测概率大于负样本的预测概率的个数
$\frac{ \sum_{x^{+} \in D^{+}} \sum_{x^{-}\in D^{-}} I(x^{+}, x^{-})}{M \times N}$
$I(x^{+}, x^{-}) = \left\{\begin{matrix} 1, f(x^{+}) > f(x^{-})\\ 0.5 f(x^{+}) = f(x^{-})\\ 0.5 f(x^{+}) < f(x^{-}) \end{matrix}\right.$

loss function

形式化地看，AUC考虑的是样本预测的排序质量，因此它与排序误差有紧密联系。给定 $m_{+}$ 个正例和 $m_{-}$ 个反例，令 $D_{+}$ ， $D_{-}$ 分别表示正反例集合，则排序“损失（loss）”可定义为：
$l_{rank} = \frac{1}{m_{+}m_{-}} \sum_{x^{+} \in D^{+}} \sum_{x^{-} \in D^{-}} (I(f(x^{+}) < f(x^{-}) + \frac{1}{2}I(f(x^{+}) = f(x^{-})))$
即考虑每一对正反例，或正例预测值小于返利，则记一个“罚分”，若相等，则记0.5个罚分。 $l_{rank}$ 对应的是ROC曲线之上的面积；若一个正例在ROC曲线上对应的标记为（x, y), 则x恰是排序在其之前的反例所占的比例，因此有
$AUC = 1 - l_{rank}$