C2_W2_SoftMax_吴恩达_中英_Pytorch_Tensorflow_Numpy

Optional Lab - Softmax Function

In this lab, we will explore the softmax function. This function is used in both Softmax Regression and in Neural Networks when solving Multiclass Classification problems.
在本次实验中,我们将探讨softmax函数。该函数在Softmax回归和解决多类分类问题时都使用了。

显示错误

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
%matplotlib widget
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

Note: Normally, in this course, the notebooks use the convention of starting counts with 0 and ending with N-1, ∑ i = 0 N − 1 \sum_{i=0}^{N-1} i=0N1, while lectures start with 1 and end with N, ∑ i = 1 N \sum_{i=1}^{N} i=1N. This is because code will typically start iteration with 0 while in lecture, counting 1 to N leads to cleaner, more succinct equations. This notebook has more equations than is typical for a lab and thus will break with the convention and will count 1 to N.
注意:通常,在本课程中,笔记本使用计数从0开始,以N-1结束的惯例, ∑ i = 0 N − 1 \sum_{i=0}^{N-1} i=0N1,而讲座从1开始,以N结束, ∑ i = 1 N \sum_{i=1}^{N} i=1N。这是因为代码通常从0开始迭代,而在课堂上,从1数到N会得到更清晰、更简洁的方程。这个笔记本有更多的方程,而不是典型的实验室,因此将打破惯例,将从1到N计数。

Softmax Function

In both softmax regression and neural networks with Softmax outputs, N outputs are generated and one output is selected as the predicted category. In both cases a vector z \mathbf{z} z is generated by a linear function which is applied to a softmax function. The softmax function converts z \mathbf{z} z into a probability distribution as described below. After applying softmax, each output will be between 0 and 1 and the outputs will add to 1, so that they can be interpreted as probabilities. The larger inputs will correspond to larger output probabilities.
在softmax回归和具有softmax输出的神经网络中,都产生N个输出,并选择一个输出作为预测类别。在这两种情况下,向量 z \mathbf{z} z都是由应用于softmax函数的线性函数生成的。softmax函数将 z \mathbf{z} z转换为如下所述的概率分布。应用softmax后,每个输出将在0到1之间,输出相加为1,因此它们可以被解释为概率。更大的输入将对应更大的输出概率。
显示错误

The softmax function can be written:
a j = e z j ∑ k = 1 N e z k (1) a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1} aj=k=1Nezkezj(1)
The output a \mathbf{a} a is a vector of length N, so for softmax regression, you could also write:
输出 a \mathbf{a} a是一个长度为N的向量,所以对于softmax回归,你也可以这样写:

a ( x ) = [ P ( y = 1 ∣ x ; w , b ) ⋮ P ( y = N ∣ x ; w , b ) ]   = 1 ∑ k = 1 N e z k [ e z 1 ⋮ e z N ] \begin{align} \mathbf{a}(x) = \begin{bmatrix} P(y = 1 | \mathbf{x}; \mathbf{w},b) \\ \vdots \\ P(y = N | \mathbf{x}; \mathbf{w},b) \end{bmatrix} \ = \frac{1}{ \sum_{k=1}^{N}{e^{z_k} }} \begin{bmatrix} e^{z_1} \\ \vdots \\ e^{z_{N}} \\ \end{bmatrix} \tag{2} \end{align} a(x)= P(y=1∣x;w,b)P(y=Nx;w,b)  =k=1Nezk1 ez1ezN (2)

Which shows the output is a vector of probabilities. The first entry is the probability the input is the first category given the input x \mathbf{x} x and parameters w \mathbf{w} w and b \mathbf{b} b.
这表明输出是一个概率向量。第一个条目是给定输入 x \mathbf{x} x和参数 w \mathbf{w} w b \mathbf{b} b,输入是第一个类别的概率。

Let’s create a NumPy implementation:
让我们创建一个NumPy实现:

def my_softmax(z):
    ez = np.exp(z)              #element-wise exponenial
    sm = ez/np.sum(ez)
    return(sm)

Below, vary the values of the z inputs using the sliders.
下面,使用滑块改变z输入的值。

plt.close("all")
plt_softmax(my_softmax)

显示错误

As you are varying the values of the z’s above, there are a few things to note:
当你改变上面的z’s值时,需要注意以下几点:

  • the exponential in the numerator of the softmax magnifies small differences in the values
    • softmax分子中的指数放大了数值中的微小差异
  • the output values sum to one
    • 输出值之和为1
  • the softmax spans all of the outputs. A change in z0 for example will change the values of a0-a3. Compare this to other activations such as ReLu or Sigmoid which have a single input and single output.
    • softmax覆盖所有输出。例如,’ z0 ‘的变化将改变’ a0 ’ - ’ a3 '的值。将此与具有单个输入和单个输出的其他激活(如ReLu或Sigmoid)进行比较。

Cost

显示错误

The loss function associated with Softmax, the cross-entropy loss, is:
与Softmax相关的损失函数,即交叉熵损失为:
L ( a , y ) = { − l o g ( a 1 ) , if  y = 1 . ⋮ − l o g ( a N ) , if  y = N \begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation} L(a,y)= log(a1),log(aN),if y=1.if y=N(3)

Where y is the target category for this example and a \mathbf{a} a is the output of a softmax function. In particular, the values in a \mathbf{a} a are probabilities that sum to one.
其中y是本例的目标类别, a \mathbf{a} a是softmax函数的输出。特别地, a \mathbf{a} a中的值是求和为1的概率。

Recall: In this course, Loss is for one example while Cost covers all examples.
回忆: 在本课程中,损失是一个例子,而成本涵盖所有例子。

Note in (3) above, only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an ‘indicator function’ that will be 1 when the index matches the target and zero otherwise.
注意在上面(3)中,只有目标对应的线对损失有贡献,其他线为零。为了编写成本方程,我们需要一个“指示函数”,当指标与目标匹配时,它将为1,否则为零。
1 { y = = n } = = { 1 , if  y = = n . 0 , otherwise . \mathbf{1}\{y == n\} = =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases} 1{y==n}=={1,0,if y==n.otherwise.
Now the cost is:
J ( w , b ) = − [ ∑ i = 1 m ∑ j = 1 N 1 { y ( i ) = = j } log ⁡ e z j ( i ) ∑ k = 1 N e z k ( i ) ] \begin{align} J(\mathbf{w},b) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4} \end{align} J(w,b)=[i=1mj=1N1{y(i)==j}logk=1Nezk(i)ezj(i)](4)

Where m m m is the number of examples, N N N is the number of outputs. This is the average of all the losses.
其中 m m m为示例数, N N N为输出数。这是所有损失的平均值。

Tensorflow

This lab will discuss two ways of implementing the softmax, cross-entropy loss in Tensorflow, the ‘obvious’ method and the ‘preferred’ method. The former is the most straightforward while the latter is more numerically stable.
本实验将讨论两种实现softmax的方法,Tensorflow中的交叉熵损失,“明显”方法和“首选”方法。前者是最直接的,而后者在数值上更稳定。

Let’s start by creating a dataset to train a multiclass classification model.
让我们从创建一个数据集来训练一个多类分类模型开始。

# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

The Obvious organization()

The model below is implemented with the softmax as an activation in the final Dense layer.
下面的模型实现了在最后一层使用softmax作为激活。

The loss function is separately specified in the compile directive.
损失函数在’ compile '指令中单独指定

The loss function SparseCategoricalCrossentropy. The loss described in (3) above. In this model, the softmax takes place in the last layer. The loss function takes in the softmax output which is a vector of probabilities.
SparseCategoricalCrossentropy损失函数。上面(3)中描述的损失函数。在这个模型中,softmax发生在最后一层。损失函数接受softmax输出的向量,即概率。

model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

model.fit(
    X_train,y_train,
    epochs=10
)
        
Epoch 1/10
63/63 [==============================] - 0s 2ms/step - loss: 1.0336
Epoch 2/10
63/63 [==============================] - 0s 2ms/step - loss: 0.5932
Epoch 3/10
63/63 [==============================] - 0s 2ms/step - loss: 0.4330
Epoch 4/10
63/63 [==============================] - 0s 2ms/step - loss: 0.2878
Epoch 5/10
63/63 [==============================] - 0s 2ms/step - loss: 0.1397
Epoch 6/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0817
Epoch 7/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0609
Epoch 8/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0507
Epoch 9/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0450
Epoch 10/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0404





<keras.callbacks.History at 0x2009a247520>

Because the softmax is integrated into the output layer, the output is a vector of probabilities.
因为softmax被整合到输出层,输出是一个概率向量。

p_nonpreferred = model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
[[1.73e-03 6.38e-03 9.69e-01 2.24e-02]
 [9.98e-01 1.48e-03 4.05e-06 3.23e-04]]
largest value 0.99999964 smallest value 1.0463555e-10

Preferred 显示错误

Recall from lecture, more stable and accurate results can be obtained if the softmax and loss are combined during training. This is enabled by the ‘preferred’ organization shown here.
回顾讲座,如果在训练中结合softmax和loss,可以得到更加稳定和准确的结果。这是由这里显示的“preferred”组织启用的。

In the preferred organization the final layer has a linear activation. For historical reasons, the outputs in this form are referred to as logits. The loss function has an additional argument: from_logits = True. This informs the loss function that the softmax operation should be included in the loss calculation. This allows for an optimized implementation.
在之前组织中,最后一层具有线性激活。由于历史原因,这种形式的输出称为logits。损失函数有一个额外的参数:from_logits = True。这将通知损失函数,softmax操作应包含在损失计算中。这允许优化实现。

preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),
)

preferred_model.fit(
    X_train,y_train,
    epochs=10
)
        
Epoch 1/10
63/63 [==============================] - 0s 2ms/step - loss: 0.9375
Epoch 2/10
63/63 [==============================] - 0s 2ms/step - loss: 0.3674
Epoch 3/10
63/63 [==============================] - 0s 2ms/step - loss: 0.1787
Epoch 4/10
63/63 [==============================] - 0s 2ms/step - loss: 0.1044
Epoch 5/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0747
Epoch 6/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0607
Epoch 7/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0521
Epoch 8/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0467
Epoch 9/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0425
Epoch 10/10
63/63 [==============================] - 0s 2ms/step - loss: 0.0395





<keras.callbacks.History at 0x200e66b69d0>
Output Handling(输出结果)

Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability.
注意,在之前模型中,输出不是概率,而是从大的负数到大的正数的范围。当执行期望概率的预测时,必须通过softmax发送输出。

Let’s look at the preferred model outputs:
让我们看一看之前模型的输出:

p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))
two example output vectors:
 [[-4.   -1.55  3.53 -1.07]
 [ 5.49  0.48 -4.25 -8.32]]
largest value 12.004236 smallest value -17.230433

The output predictions are not probabilities!
输出预测不是概率!

If the desired output are probabilities, the output should be be processed by a softmax.
如果期望的输出是概率,输出应该由softmax处理。

sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))
two example output vectors:
 [[5.27e-04 6.11e-03 9.83e-01 9.88e-03]
 [9.93e-01 6.57e-03 5.80e-05 9.92e-07]]
largest value 0.9999994 smallest value 8.883205e-12

To select the most likely category, the softmax is not required. One can find the index of the largest output using np.argmax().
要选择最可能的类别,不需要softmax。您可以使用np.argmax()找到最大输出的索引。

for i in range(5):
    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")
[-4.   -1.55  3.53 -1.07], category: 2
[ 5.49  0.48 -4.25 -8.32], category: 0
[ 3.94  0.7  -3.16 -6.49], category: 0
[-1.42  3.89 -0.8  -1.69], category: 1
[-1.2  -3.11  4.69 -7.12], category: 2

SparseCategorialCrossentropy(稀疏分类交叉熵损失函数) or CategoricalCrossEntropy

Tensorflow has two potential formats for target values and the selection of the loss defines which is expected.
Tensorflow有两种潜在的目标值格式,选择损失定义了期望的哪一种。

  • SparseCategorialCrossentropy: expects the target to be an integer corresponding to the index. For example, if there are 10 potential target values, y would be between 0 and 9.
    • 稀疏分类交叉熵损失函数:期望目标是与索引对应的整数。例如,如果有10个潜在的目标值,y将在0到9之间。(from_logits = true)
  • CategoricalCrossEntropy: Expects the target value of an example to be one-hot encoded where the value at the target index is 1 while the other N-1 entries are zero. An example with 10 potential target values, where the target is 2 would be [0,0,1,0,0,0,0,0,0,0].
    • 分类交叉熵损失函数:期望目标值是one-hot编码,其中目标索引处的值是1,而其他N-1个值为0。一个目标值为10的例子,目标值为2,将是[0,0,1,0,0,0,0,0,0,0]。(from_logits = false)

Congratulations!

In this lab you

  • Became more familiar with the softmax function and its use in softmax regression and in softmax activations in neural networks.
    • 更加熟悉softmax函数及其在softmax回归和softmax激活函数中的使用。
  • Learned the preferred model construction in Tensorflow:(学习了在Tensorflow中首选的模型构建)
    • No activation on the final layer (same as linear activation)(最后一层没有激活(与线性激活相同))
    • SparseCategoricalCrossentropy loss function(稀疏分类交叉熵损失函数)
    • use from_logits=True()
  • Recognized that unlike ReLu and Sigmoid, the softmax spans multiple outputs.(认识到,与ReLu和Sigmoid不同,softmax跨越多个输出。)

Numerical Stability (optional) 数值稳定性(可选)

This section discusses some of the methods employed to improve numerical stability. This is for the interested reader and is not at all required.
本节讨论了一些提高数值稳定性的方法。对于感兴趣的读者来说,这是必要的,但不是必须的。

Softmax Numerical Stability

The input’s to the softmax are the outputs of a linear layer z j = w j ⋅ x ( i ) + b z_j = \mathbf{w_j} \cdot \mathbf{x}^{(i)}+b zj=wjx(i)+b. These may
be large numbers. The first step of the softmax algorithm computes e z j e^{z_j} ezj. This can result in an overflow error if the number gets too large. Try running the cell below:
softmax的输入是线性层 z j = w j ⋅ x ( i ) + b z_j = \mathbf{w_j} \cdot \mathbf{x}^{(i)}+b zj=wjx(i)+b的输出。这些可能
是大数字。softmax算法的第一步计算 e z j e^{z_j} ezj。如果数目太大,这可能导致溢出错误。尝试运行下面的单元格:

for z in [500,600,700,800]:
    ez = np.exp(z)
    zs = "{" + f"{z}" + "}"
    print(f"e^{zs} = {ez:0.2e}")
e^{500} = 1.40e+217
e^{600} = 3.77e+260
e^{700} = 1.01e+304
e^{800} = inf


C:\Users\10766\AppData\Local\Temp\ipykernel_17368\1141864107.py:2: RuntimeWarning: overflow encountered in exp
  ez = np.exp(z)

The operation will generate an overflow if the exponent gets too large. Naturally, my_softmax() will generate the same errors:
如果指数太大,操作将产生溢出。当然,’ my_softmax() '也会产生相同的错误:

z_tmp = np.array([[500,600,700,800]])
my_softmax(z_tmp)
C:\Users\10766\AppData\Local\Temp\ipykernel_17368\1989128138.py:2: RuntimeWarning: overflow encountered in exp
  ez = np.exp(z)              #element-wise exponenial
C:\Users\10766\AppData\Local\Temp\ipykernel_17368\1989128138.py:3: RuntimeWarning: invalid value encountered in true_divide
  sm = ez/np.sum(ez)





array([[ 0.,  0.,  0., nan]])

Numerical stability can be improved by reducing the size of the exponent.
通过减小指数的大小可以提高数值稳定性。

Recall
e a + b = e a e b e^{a + b} = e^ae^b ea+b=eaeb
if the b b b were the opposite sign of a a a, this would reduce the size of the exponent. Specifically, if you multiplied the softmax by a fraction:
如果 b b b a a a的相反符号,这将减小指数的大小。具体来说,如果你将softmax乘以一个分数:
a j = e z j ∑ i = 1 N e z i e − b e − b a_j = \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-b}}{ {e^{-b}}} aj=i=1Neziezjebeb
the exponent would be reduced and the value of the softmax would not change. If b b b in e b e^b eb were the largest value of the z j z_j zj’s, m a x j ( z ) max_j(\mathbf{z}) maxj(z), the exponent would be reduced to its smallest value.
指数会减少,softmax的值不会改变。如果 e b e^b eb中的 b b b z j z_j zj s, m a x j ( z ) max_j(\mathbf{z}) maxj(z)的最大值,则指数将减少到其最小值。
a j = e z j ∑ i = 1 N e z i e − m a x j ( z ) e − m a x j ( z ) = e z j − m a x j ( z ) ∑ i = 1 N e z i − m a x j ( z ) \begin{align} a_j &= \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-max_j(\mathbf{z})}}{ {e^{-max_j(\mathbf{z})}}} \\ &= \frac{e^{z_j-max_j(\mathbf{z})}}{ \sum_{i=1}^{N}{e^{z_i-max_j(\mathbf{z})} }} \end{align} aj=i=1Neziezjemaxj(z)emaxj(z)=i=1Nezimaxj(z)ezjmaxj(z)
It is customary to say C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z) since the equation would be correct with any constant C.
习惯上说 C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z),因为任何常数C都是正确的。

a j = e z j − C ∑ i = 1 N e z i − C where C = m a x j ( z ) (5) a_j = \frac{e^{z_j-C}}{ \sum_{i=1}^{N}{e^{z_i-C} }} \quad\quad\text{where}\quad C=max_j(\mathbf{z})\tag{5} aj=i=1NeziCezjCwhereC=maxj(z)(5)

If we look at our troublesome example where z \mathbf{z} z contains 500,600,700,800, C = m a x j ( z ) = 800 C=max_j(\mathbf{z})=800 C=maxj(z)=800:
如果我们看看这个麻烦的例子, z \mathbf{z} z包含500,600,700,800, C = m a x j ( z ) = 800 C=max_j(\mathbf{z})=800 C=maxj(z)=800:

a ( x ) = 1 e 500 − 800 + e 600 − 800 + e 700 − 800 + e 800 − 800 [ e 500 − 800 e 600 − 800 e 700 − 800 e 800 − 800 ]   = [ 5.15 e − 131 1.38 e − 87 3.7 e − 44 1.0 ] \begin{align} \mathbf{a}(x) = \frac{1}{ e^{500-800} + e^{600-800} + e^{700-800} + e^{800-800}} \begin{bmatrix} e^{500-800} \\ e^{600-800} \\ e^{700-800} \\ e^{800-800} \\ \end{bmatrix} \ = \begin{bmatrix} 5.15e-131 \\ 1.38e-87 \\ 3.7e-44 \\ 1.0 \\ \end{bmatrix} \end{align} a(x)=e500800+e600800+e700800+e8008001 e500800e600800e700800e800800  = 5.15e1311.38e873.7e441.0

Let’s rewrite my_softmax to improve its numerical stability.
让我们重新编写my_softmax来提高其数值稳定性。

def my_softmax_ns(z):
    """numerically stablility improved"""
    bigz = np.max(z)
    ez = np.exp(z-bigz)              # minimize exponent
    sm = ez/np.sum(ez)
    return(sm)

Let’s try this and compare it to the tensorflow implementation:
让我们尝试一下,并与tensorflow的实现进行比较:

z_tmp = np.array([500.,600,700,800])
print(tf.nn.softmax(z_tmp).numpy(), "\n", my_softmax_ns(z_tmp))
[5.15e-131 1.38e-087 3.72e-044 1.00e+000] 
 [5.15e-131 1.38e-087 3.72e-044 1.00e+000]

Large values no longer cause an overflow.
较大的值不再导致溢出。

Cross Entropy Loss Numerical Stability

The loss function associated with Softmax, the cross-entropy loss, is repeated here:
与Softmax相关的损失函数,即交叉熵损失,在这里重复:

L ( a , y ) = { − l o g ( a 1 ) , if  y = 1 . ⋮ − l o g ( a N ) , if  y = N \begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \end{equation} L(a,y)= log(a1),log(aN),if y=1.if y=N

Where y is the target category for this example and a \mathbf{a} a is the output of a softmax function. In particular, the values in a \mathbf{a} a are probabilities that sum to one.
其中y是本例的目标类别, a \mathbf{a} a是softmax函数的输出。特别地, a \mathbf{a} a中的值是求和为1的概率。

Let’s consider a case where the target is two ( y = 2 y=2 y=2) and just look at the loss for that case. This will result in the loss being:
让我们考虑一个目标是2 ( y = 2 y=2 y=2)的情况,看看这种情况下的损失。这将导致损失如下:
L ( a ) = − l o g ( a 2 ) L(\mathbf{a})= -log(a_2) L(a)=log(a2)
Recall that a 2 a_2 a2 is the output of the softmax function described above, so this can be written:
回想一下, a 2 a_2 a2是上面描述的softmax函数的输出,所以可以这样写:
L ( z ) = − l o g ( e z 2 ∑ i = 1 N e z i ) (6) L(\mathbf{z})= -log\left(\frac{e^{z_2}}{ \sum_{i=1}^{N}{e^{z_i} }}\right) \tag{6} L(z)=log(i=1Neziez2)(6)
This can be optimized. However, to make those optimizations, the softmax and the loss must be calculated together as shown in the ‘preferred’ Tensorflow implementation you saw above.
这是可以优化的。然而,为了进行这些优化,softmax和损失必须一起计算,如您在上面看到的“首选”Tensorflow实现所示。

Starting from (6) above, the loss for the case of y=2:
由上式(6)出发,y=2时的损失:
l o g ( a b ) = l o g ( a ) − l o g ( b ) log(\frac{a}{b}) = log(a) - log(b) log(ba)=log(a)log(b), so (6) can be rewritten:
L ( z ) = − [ l o g ( e z 2 ) − l o g ∑ i = 1 N e z i ] (7) L(\mathbf{z})= -\left[log(e^{z_2}) - log \sum_{i=1}^{N}{e^{z_i} }\right] \tag{7} L(z)=[log(ez2)logi=1Nezi](7)
The first term can be simplified to just z 2 z_2 z2:
第一项可以简化为z_2
L ( z ) = − [ z 2 − l o g ( ∑ i = 1 N e z i ) ] = l o g ∑ i = 1 N e z i ⏟ logsumexp() − z 2 (8) L(\mathbf{z})= -\left[z_2 - log( \sum_{i=1}^{N}{e^{z_i} })\right] = \underbrace{log \sum_{i=1}^{N}{e^{z_i} }}_\text{logsumexp()} -z_2 \tag{8} L(z)=[z2log(i=1Nezi)]=logsumexp() logi=1Neziz2(8)
It turns out that the l o g ∑ i = 1 N e z i log \sum_{i=1}^{N}{e^{z_i} } logi=1Nezi term in the above equation is so often used, many libraries have an implementation. In Tensorflow this is tf.math.reduce_logsumexp(). An issue with this sum is that the exponent in the sum could overflow if z i z_i zi is large. To fix this, we might like to subtract e m a x j ( z ) e^{max_j(\mathbf{z})} emaxj(z) as we did above, but this will require a bit of work:
事实证明,上述等式中的 l o g ∑ i = 1 N e z i log \sum_{i=1}^{N}{e^{z_i}} logi=1Nezi项非常常用,许多库都有实现。在Tensorflow中,这是tf.math.reduce_logsumexp()。这个求和的一个问题是,如果 z i z_i zi太大,求和中的指数可能会溢出。为了解决这个问题,我们可以像上面那样减去 e m a x j ( z ) e^{max_j(\mathbf{z})} emaxj(z),但这需要一些工作:
l o g ∑ i = 1 N e z i = l o g ∑ i = 1 N e ( z i − m a x j ( z ) + m a x j ( z ) ) = l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) e m a x j ( z ) = l o g ( e m a x j ( z ) ) + l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) = m a x j ( z ) + l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) \begin{align} log \sum_{i=1}^{N}{e^{z_i} } &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}) + max_j(\mathbf{z}))}} \tag{9}\\ &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))} e^{max_j(\mathbf{z})}} \\ &= log(e^{max_j(\mathbf{z})}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \\ &= max_j(\mathbf{z}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \end{align} logi=1Nezi=logi=1Ne(zimaxj(z)+maxj(z))=logi=1Ne(zimaxj(z))emaxj(z)=log(emaxj(z))+logi=1Ne(zimaxj(z))=maxj(z)+logi=1Ne(zimaxj(z))(9)
Now, the exponential is less likely to overflow. It is customary to say C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z) since the equation would be correct with any constant C. We can now write the loss equation:
现在,指数不太可能溢出。习惯上说 C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z),因为方程对于任何常数C都是正确的。我们现在可以写出损失方程:

L ( z ) = C + l o g ( ∑ i = 1 N e z i − C ) − z 2        where  C = m a x j ( z ) (10) L(\mathbf{z})= C+ log( \sum_{i=1}^{N}{e^{z_i-C} }) -z_2 \;\;\;\text{where } C=max_j(\mathbf{z}) \tag{10} L(z)=C+log(i=1NeziC)z2where C=maxj(z)(10)
A computationally simpler, more stable version of the loss. The above is for an example where the target, y=2 but generalizes to any target.
一个计算上更简单,更稳定的损失函数版本。上面是一个例子,其中目标y=2,但推广到任何目标。

Pytorch


import torch
import numpy as np
from sklearn.datasets import make_blobs
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
# 定义训练设备
nn.Softmax
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 创建数据集 
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

print(X_train.shape, y_train.shape)
print(X_train[:5])

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)

dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=10, shuffle=False)
(2000, 2) (2000,)
[[ 1.55508243  0.84801682]
 [-5.33749882  1.03397255]
 [-4.09353183  0.67843096]
 [-1.35928349 -1.49568732]
 [-0.67987836  3.15016353]]

在定义模型之前,我们需要介绍pytorch中的两个损失函数。

交叉熵:描述的是两个概率分布之间的差异。而神经网络输出的是向量,并不是概率分布的形式,所以需要softmax函数将向量归一化成概率分布的形式。

在之前的实验中我们知道
二分类的损失函数公式为:
L =   − y i ⋅ l o g ( p i ) − ( 1 − y i ) ⋅ l o g ( 1 − ( p i ) ) (1) L = \ -y_i \cdot log(p_i) - (1-y_i) \cdot log(1-(p_i)) \tag{1} L= yilog(pi)(1yi)log(1(pi))(1)
多分类的损失函数公式为:
L = − y i ⋅ l o g ( a i ) (2) L = - {y_i \cdot log(a_i)} \tag{2} L=yilog(ai)(2)
我们观察(1)式和(2)式,发现(2)式其实就是(1)式适用于多分类的扩展,(1)式中的概率只有2种可能(1 0),而(3)式中是 a i a_i ai(softmax函数输出的概率分布)

另一种写法:
L ( a , y ) = { − l o g ( a 1 ) , if  y = 1 . ⋮ − l o g ( a N ) , if  y = N \begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation} L(a,y)= log(a1),log(aN),if y=1.if y=N(3)

观察(2)式和(3)式的区别:(2)式是(3)式的简化,

  • y i y_i yi指的是如果样本的预测类别等于真实类别则取1,反之取0,所以(3)式将 y i y_i yi省略了
  • a \mathbf{a} a 是softmax函数的输出

NLLLoss: 对应类别上的输出,取负号。input=[-1.255, 2.588, 5.525], 真实标签为2(class=2),则loss为-5.525

l o s s ( i n p u t , c l a s s ) = − i n p u t [ c l a s s ] loss(input, class) = -input[class] loss(input,class)=input[class]

  • class: 表示真实类别

CrossEntropyLoss :在Pytorch中,CrossEntropyLoss并不是严格意义上的损失函数,它是将输入经过softmax激活函数之后,再计算其与target的交叉熵损失。通过查阅Pytorch官方文档可知,'CrossEntropyLoss’相当于NLLLossLogSoftmax的结合。

所以在pytorch实现交叉熵损失函数只需调用CrossEntropyLoss函数即可。代码如下

# 定义神经网络模型
class SoftmaxModel(nn.Module):
    def __init__(self) :
        super(SoftmaxModel,self).__init__()
        self.model = nn.Sequential(
            nn.Linear(2,25),
            nn.ReLU(),
            nn.Linear(25,15),
            nn.ReLU(),
            nn.Linear(15,4),
        )

    def forward(self,x):
        x = self.model(x)
        return x

模型构建完毕,开始训练

SoftmaxModel = SoftmaxModel()

#定义损失函数
loss_fn = nn.CrossEntropyLoss()

#定义优化器
learning_rate = 1e-2
optimizer = torch.optim.Adam(SoftmaxModel.parameters(), lr=learning_rate)

#训练次数
toal_train_step = 0

#训练轮数
epoch = 20

for i in range(epoch):
    print("---------第{}轮训练开始---------".format(i+1))
    
    #训练步骤开始
    SoftmaxModel.train()
    for data,target in dataloader:
        output = SoftmaxModel(data)
        loss = loss_fn(output,target)

        #优化器优化模型
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        toal_train_step += 1
        if toal_train_step % 10 == 0:
            print("训练次数:{},loss: {}".format(toal_train_step,loss.item()))
    

    if i == 10:
        torch.save(SoftmaxModel.state_dict(), "SoftmaxModel_{}.pkl".format(i+1))
        print("模型已保存")

---------第1轮训练开始---------
训练次数:10,loss: 0.8931765556335449
训练次数:20,loss: 0.5435497760772705
...
训练次数:190,loss: 0.008321215398609638
训练次数:200,loss: 0.0013814402045682073

---------第20轮训练开始---------
训练次数:3810,loss: 0.0015082083409652114
...
训练次数:4000,loss: 0.00017103322898037732
predict = SoftmaxModel(X_train)
print(predict[:2])
print(predict.argmax(dim=1)[:2])
tensor([[ -7.7822,  -3.5647,   5.0030,  -3.4751],
        [ 12.1620,  -5.6852,  -4.0641, -11.0309]], grad_fn=<SliceBackward0>)
tensor([2, 0])

恭喜!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值