The difference between Softmax and Softmax-Loss

本文深入探讨了Softmax函数与Softmax-Loss层在深度学习中的应用,解析了两者之间的区别与联系,特别是在数值稳定性上的考量。通过实验对比,展示了单一计算层与分层计算在精度和效率上的差异。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

转载来源:The difference between Softmax and Softmax-Loss

The softmax loss layer computes the multinomial logistic loss of the softmax of its inputs. It’s conceptually identical to a softmax layer followed by a multinomial logistic loss layer, but provides a more numerically stable gradient.

This is an introduction of softmax - loss layers from Caffe’s official document. Caffe is a very popular C++/CUDA Deep convolutional neural Networks (CNNs) library, because of the clear code structure and beautiful design, both academia and industry love to used it as machine learning extensions.

Today I want to discuss about the words quoted above: in fact, there is a considerable distance between the basic algorithm and a program which works well in practice in numerical calculations (or any other engineering area). There are lots of small tricks that may seem insignificant, but eventually cause a huge difference to the result. And the reason is very simple: theoretical work is precise and accurate because of ignoring “irrelevant” details of simplification and abstraction, so objects can operate in a series of “assumption” under the establishment of theoretical systems. But when it comes to applying the theory to practice, we need to add all the details overlooked before, otherwise it may cause a mess. Finding out what’s going on in a series of “assumption” that are no longer strictly satisfied is really difficult. Letting the original theory “roughly” fits the reality is what a data engineer needs to do.

So the softmax function σ (z) = (σ1 (z), σ2 (z),… σm (z)) can be defined as:

Its role in the Logistic regression is to translate the linear predictive value into category probability: Imagine Zi = Wi*x + Bi is the result of linear prediction, Softmax can make Zi nonnegative by letting them become exponential, then the sum of all items is normalized, now each Oi = σi (Z) can be interpreted as the probability of data x belong to the category i, or the likelihood.

Based on the maximum likelihood principle, we can define the objective function “Logistic regression” (also known as Multinomial Logistic Loss). What we need to do is to maximize the value of Oy.

And what Softmax-Loss function does is to combine these two functions:

Such as the problem of two types of Logistic regression, it's better to use the Softmax - loss function. But if you are designing a library of Deep neural Networks, you might prefer to use the two separately: because the Deep Learning model is a layered structure, the main job of a computing library is to provide a wide variety of layer, then users can choose to have a network hierarchy combined in different ways. For example, the user wants to get each category of probability likelihood value, he/she only needs a Softmax Layer, no necessary to do a multinomial Logistic Loss operation. So providing two different Layer structures is much more flexible than providing only one softmax-loss Layer. It also appears to be more modular. But there is a numerical stability problem.

First, we need to understand what the meaning of BackPropagation. For example, a 3-layer neural network is shown in the figure, each layer has input nodes and output nodes except the beginning data layer L0. Typically, a layer of input nodes is simply a “copy” of the upper-level output node, that is because all computational operations occur within each layer of the structure. For ordinary neural networks, the calculation of each layer is usually a linear mapping and then a Sigmoid nonlinear operation, for example:

Using the principle of Chain rule in calculus, you get the following formula:

Notice that the red part is related to the internal structure of the first layer of the network, it can be computed with the information of local structure of the first layer, and for the blue part, we just said the output node is actually equal to the next layer of input nodes, so we can calculate it without knowing any information about the first tier.

And do the BackPropagation:

Let’s go back to the Softmax-loss layer, because the layer has no parameters, we just have to compute the derivative of the backward pass, and since the layer is the topmost one, you can directly compute the derivative of the final output (Loss) without using the chain rule. As we mentioned before, softmax-loss layer has two inputs, a true label Y that directly comes from the bottom of the data layer, and we don’t need to update on the data layer descent Parameter, the other input comes from the compute layer below, which is a fully connected linear inner layer for the Logistic regression or the normal DNNs. Based on calculus knowledge, we can work out:

σk (Z) is the results of the Softmax calculating which is the middle step of the Softmax - loss layer.

What if the Softmax layer and the multinomial Logistic Loss layer are divided into two layers? We put the output of the Softmax layer, that is, the input of the Loss layer as Oi = σi (Z), so we need to compute the top layer at first.

As we pass this derivative down, and reach the Softmax layer, then we can apply the chain rule:

If you try to check it outtake with Chain rule:

Although the end result is the same, but we can draw a conclusion that if divided into two levels of calculation, we need to do much more calculations. And we are more concerned about the stability of the numerical. Because floating - point number has limited precision, each operation will accumulate a certain amount of error, if Oy is very inaccurate, which means the value of correct category’s probability is very small (near 0), there will be overflow danger. We can do some experiments(written in Julia):

function softmax(z)
#z = z — maximum(z)
o = exp(z)
return o / sum(o)
end
function gradient_together(z, y)
o = softmax(z)
o[y] -= 1.0
return o
end
function gradient_separated(z, y)
o = softmax(z)
∂o_∂z = diagm(o) — o*o’
∂f_∂o = zeros(size(o))
∂f_∂o[y] = -1.0 / o[y]
return ∂o_∂z * ∂f_∂o
end

And because of the lower precision of the float (Float32) than double (Float64), we use the result of double as the approximate “correct value”, and then compare the difference between the result and the correct value computed by float in both cases. The drawing code is as follows:

using DataFrames
using Gadfly
M = 100
y = 1
zy = vec(10f0 .^ (-38:5:38)) # float range ~ [1.2*10^-38, 3.4*10³⁸]
zy = [-reverse(zy);zy]
srand(12345)
n_rep = 50
discrepancy_together = zeros(length(zy), n_rep)
discrepancy_separated = zeros(length(zy), n_rep)
for i = 1:n_r
= rand(Float32, M) # use float instead of double
discrepancy_together[:,i] = [begin
z[y] = x
true_grad = gradient_together(convert(Array{Float64},z), y)
got_grad = gradient_together(z, y)
abs(true_grad[y] — got_grad[y])
end for x in zy]
discrepancy_separated[:,i] = [begin
z[y] = x
true_grad = gradient_together(convert(Array{Float64},z), y)
got_grad = gradient_separated(z, y)
abs(true_grad[y] — got_grad[y])
end for x in zy]
end
df1 = DataFrame(x=zy, y=vec(mean(discrepancy_together,2)),
label=”together”)
df2 = DataFrame(x=zy, y=vec(mean(discrepancy_separated,2)),
label=”separated”
= vcat(df1, df2)
format_func(x) = @sprintf(“%s10<sup>%d</sup>”, x<0?”-”:””,int(log10(abs(x))))
the_plot = plot(df, x=”x”, y=”y”, color=”label”,
Geom.point, Geom.line, Geom.errorbar,
Guide.xticks(ticks=int(linspace(1, length(zy), 10))),
Scale.x_discrete(labels=format_func),
Guide.xlabel(“z[y]”), Guide.ylabel(“discrepancy”))

We need to set that coordinate to be a negative number with a large absolute value. In the resulting graph we can see comparisons across the range of values. The horizontal axis is the size of the graph, and the ordinate is the difference between the results calculated by two methods and the “real value”.

The first thing you find out is that a single layer of direct computing is really better than splitting into two layers, but the gap is actually very small. Look to the left, you will find that the yellow points disappear, that is because the result has become NaN. For example, if Oy becomes zero, resulting in the underflow beyond the decimal point precision range, we can get INF (1/Oy), when multiplied by others it will directly get NaN, that is, Not a Number. Looking at the Blue line, it seems strange that the accurate seems to be improved, it is because our “real value” is also underflow, although the double number is much higher than float, it is also limited. According to the Wikipedia, precision range for float is roughly at 10-³⁸ to 10³⁸, and the double number is around 10-³⁰⁸ to 10³⁰⁸ , so we can choose point 10^(-2) in the diagram for testing.

So for x =10^-2, float it’s already overflow, it’s around 10^-44, though it’s still in the range of double, the difference between 0 and float 64 is so small that there’s no difference in the picture. If the index is smaller, it will also cause this double number underflow, the result of our “real value” will also be zero, which means the “error” becomes 0. and another problem is that when x reaches 10², the blue and yellow lines are all gone, which means they all get NaN(Not a Number).

One of the solution to this problem is shown in the second line of code which is commented. We subtract each element (Z) by the maximum value before exponentiate it. The maximum number becomes 0, which will not cause overflow problem again, but other numbers in normal range now will be subtracted by a large number, which become large negative values may cause underflow. But since the underflow gets 0 (a very meaningful approximation), there’s no strange NaN for subsequent computations.

Reference:

  1. Numerical Computing with IEEE Floating Point Arithmetic: Including One Theorem, One Rule of Thumb, and One Hundred and One Exercises
  2. Convolutional Neural Networks for Visual Recognition
  3. An overview of gradient descent optimization algorithms
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值