为什么神经网络使用互熵而不是分类误差-优快云博客

本文通过实例分析了神经网络使用互熵误差而非分类误差的原因，对比了互熵误差与分类误差、均方误差在评估模型表现时的差异，并解释了互熵误差在处理错误分类时的优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Intro

分类神经网络使用互熵（cross entropy）而不是分类误差来计算代价。这是为什么呢？我从 google 找到了一篇文章。

分析

假定存在神经网络，用于预测政治派别。使用 softmax 作为激励，输出为 3 个类别的概率。举例如下：

computed       | targets              | correct?
-----------------------------------------------
0.3  0.3  0.4  | 0  0  1 (democrat)   | yes
0.3  0.4  0.3  | 0  1  0 (republican) | yes
0.1  0.2  0.7  | 1  0  0 (other)      | no

错误率为 $ \frac{1}{3} $，正确率 $ \frac{2}{3} $。这个神经网络第一个输入的互熵误差为 $ -\ln{0.3} \times 0 - \ln{0.3} \times 0 - \ln{0.4} \times 1 = -\ln{0.4}$。平均互熵误差为
$ -\ln{0.3} \times 0 - \ln{0.3} \times 0 - \ln{0.4} \times 1 -\ln{0.3} \times 0 - \ln{0.4} \times 1 - \ln{0.3} \times 0 -\ln{0.1} \times 1 - \ln{0.2} \times 0 - \ln{0.7} \times 0 = 1.38 $。

computed       | targets              | correct?
-----------------------------------------------
0.1  0.2  0.7  | 0  0  1 (democrat)   | yes
0.1  0.7  0.2  | 0  1  0 (republican) | yes
0.3  0.4  0.3  | 1  0  0 (other)      | no

同理，对于该网络，错误率为 $ \frac{1}{3} $，正确率 $ \frac{2}{3} $；互熵误差为
$ -\ln{0.1} \times 0 - \ln{0.2} \times 0 - \ln{0.7} \times 1 -\ln{0.1} \times 0 - \ln{0.7} \times 1 - \ln{0.2} \times 0 -\ln{0.3} \times 1 - \ln{0.4} \times 0 - \ln{0.3} \times 0 = 0.64 $。虽然分类误差相同，这两个互熵误差存在这区别，第二个小于第一个。

其实 MSE（mean squared error），或者说 L2 距离，也是不错的。第一个神经网络误差为 $(0.54 + 0.54 + 1.34) / 3 = 0.81$，第二个网络神经网络误差为 $(0.14 + 0.14 + 0.74) / 3 = 0.34$。然而 MSE 过分强调那些 error 的例子，cross-entropy 没有这个现象，只处理 false-positive。