矩阵求导 Ref 2

注:本文来自 Dwzb, 因 csdn 篇幅合并超限分篇连载,本篇为 Ref 2

略作重排,如有内容异常,请看原文。


矩阵求导总结(二)

Dwzb 2020-01-13 00:00:00

本文承接上一篇。

链式法则

当目标函数具有层级结构时,链式法则可能较为适用。例如,对于目标函数 l = f ( Y ) l = f (\mathbf {Y}) l=f(Y),其中 Y = g ( X ) \mathbf {Y} = g (\mathbf {X}) Y=g(X),可以分别求出 ∂ l ∂ Y \frac {\partial l}{\partial \mathbf {Y}} Yl ∂ Y ∂ X \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} XY,再通过某种方式连接。然而,不推荐使用链式法则,原因如下:

  1. 计算 ∂ Y ∂ X \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} XY 时,可能涉及矩阵对矩阵求导或向量对向量求导,这往往会增加问题的复杂性。
  2. 链式法则公式受求导布局的影响,容易记错。
  3. 即使存在多层结构,也可以不使用链式法则完成求导,具体方法将在例题中展示。

链式法则介绍

本节介绍不同情况下的链式法则。

1.向量对向量求导:假设三个向量存在依赖关系 x → y → z \mathbf {x} \to \mathbf {y} \to \mathbf {z} xyz,其长度分别为 a a a b b b c c c,则链式法则如下:

  • 分子布局
    ∂ z ∂ x = ∂ z ∂ y ∂ y ∂ x \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {y}} \frac {\partial \mathbf {y}}{\partial \mathbf {x}} xz=yzxy
    注意维度关系: ( c × a ) = ( c × b ) × ( b × a ) (c \times a) = (c \times b) \times (b \times a) (c×a)=(c×b)×(b×a)

  • 分母布局
    ∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \frac {\partial \mathbf {y}}{\partial \mathbf {x}} \frac {\partial \mathbf {z}}{\partial \mathbf {y}} xz=xyyz
    注意维度关系: ( a × c ) = ( a × b ) × ( b × c ) (a \times c) = (a \times b) \times (b \times c) (a×c)=(a×b)×(b×c)

这两个公式仅适用于三个变量均为向量的情况。可以发现,两种布局方式的公式不同。分子布局形式更符合我们对链式法则的认知,但兼容性较差。例如,当 z \mathbf {z} z 退化为标量时,标量对向量求导通常采用分母布局,而向量对向量求导则采用分子布局。布局方式的混用不仅会导致混乱,还会改变链式法则的公式,具体可见下一部分。

2.标量对向量求导

  • 分子布局
    ∂ z ∂ x = ( ∂ y ∂ x ) T ∂ z ∂ y \frac {\partial z}{\partial \mathbf {x}} = \left (\frac {\partial \mathbf {y}}{\partial \mathbf {x}}\right)^T \frac {\partial z}{\partial \mathbf {y}} xz=(xy)Tyz
    注意维度关系: ( a × 1 ) = ( a × b ) × ( b × 1 ) (a \times 1) = (a \times b) \times (b \times 1) (a×1)=(a×b)×(b×1)

  • 分母布局
    ∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y \frac {\partial z}{\partial \mathbf {x}} = \frac {\partial \mathbf {y}}{\partial \mathbf {x}} \frac {\partial z}{\partial \mathbf {y}} xz=xyyz
    注意维度关系: ( a × 1 ) = ( a × b ) × ( b × 1 ) (a \times 1) = (a \times b) \times (b \times 1) (a×1)=(a×b)×(b×1)

可以看到,使用分母布局时,公式较为统一,但顺序与我们对链式法则公式的认知不符,难以记忆,大致是从右往左写,顺序完全相反。

如果存在更多变量,如 y 1 → y 2 → ⋯ → y n → z \mathbf {y}_1 \to \mathbf {y}_2 \to \cdots \to \mathbf {y}_n \to z y1y2ynz,则分母布局的链式法则公式如下:
∂ z ∂ y 1 = ∂ y 2 ∂ y 1 ∂ y 3 ∂ y 2 ⋯ ∂ y n ∂ y n − 1 ∂ z ∂ y n \frac {\partial z}{\partial \mathbf {y}_1} = \frac {\partial \mathbf {y}_2}{\partial \mathbf {y}_1} \frac {\partial \mathbf {y}_3}{\partial \mathbf {y}_2} \cdots \frac {\partial \mathbf {y}_n}{\partial \mathbf {y}_{n - 1}} \frac {\partial z}{\partial \mathbf {y}_n} y1z=y1y2y2y3yn1ynynz

3.标量对矩阵求导:由于涉及向量化操作,改变了矩阵的结构,因此不太方便直接写出链式法则。假设依赖关系为 X → Y → z \mathbf {X} \to \mathbf {Y} \to z XYz,两个矩阵的维度分别为 m × n m \times n m×n p × q p \times q p×q,那么导数的维度如下(这里仅考虑分母布局):
∂ z ∂ X : m × n ∂ z ∂ Y : p × q ∂ Y ∂ X : m n × p q \begin {align*} \frac {\partial z}{\partial \mathbf {X}} & : \quad m \times n \\ \frac {\partial z}{\partial \mathbf {Y}} & : \quad p \times q \\ \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} & : \quad mn \times pq \\ \end {align*} XzYzXY:m×n:p×q:mn×pq

从矩阵维度来看,三者关系不会是
∂ z ∂ X = ∂ Y ∂ X ∂ z ∂ Y \frac {\partial z}{\partial \mathbf {X}} = \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} \frac {\partial z}{\partial \mathbf {Y}} Xz=XYYz
而可能是
vec ( ∂ z ∂ X ) = ∂ Y ∂ X vec ( ∂ z ∂ Y ) \text {vec}\left (\frac {\partial z}{\partial \mathbf {X}}\right) = \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} \text {vec}\left (\frac {\partial z}{\partial \mathbf {Y}}\right) vec(Xz)=XYvec(Yz)
虽然未查到相关资料证实,但经过几个例子的验证,该式是正确的。从下面的例题中也可以看出,即使该式成立,其计算过程也过于繁琐。

4.总结:不推荐使用链式法则。如果必须使用,则仅推荐公式(1)的用法,即使用分母布局且仅涉及向量,但该方法的适用范围过窄。下面通过两个例题展示推荐的求导方法。

例题

1.标量对向量求导:已知 l = z T z l = \mathbf {z}^T \mathbf {z} l=zTz z = A x \mathbf {z} = \mathbf {A} \mathbf {x} z=Ax,求 ∂ l ∂ x \frac {\partial l}{\partial \mathbf {x}} xl

  • 使用链式法则:由于
    ∂ l ∂ x = ∂ z ∂ x ∂ l ∂ z \frac {\partial l}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {x}} \frac {\partial l}{\partial \mathbf {z}} xl=xzzl
    因此,需要分别求出 ∂ z ∂ x \frac {\partial \mathbf {z}}{\partial \mathbf {x}} xz ∂ l ∂ z \frac {\partial l}{\partial \mathbf {z}} zl

首先,对 l l l 进行微分:
d l = tr [ d ( z T z ) ] = tr [ d z T z + z T d z ] = tr [ 2 z T d z ] \mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]

接着,计算 d z \mathrm {d}\mathbf {z} dz
d z = d ( A x ) = A d x \mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {A} \mathbf {x}) = \mathbf {A} \mathrm {d}\mathbf {x} dz=d(Ax)=Adx

因此, ∂ l ∂ z = 2 z \frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {z} zl=2z ∂ z ∂ x = A T \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \mathbf {A}^T xz=AT

最终, ∂ l ∂ x = ∂ z ∂ x ∂ l ∂ z = 2 A T z = 2 A T A x \frac {\partial l}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {x}} \frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {A}^T \mathbf {z} = 2\mathbf {A}^T \mathbf {A} \mathbf {x} xl=xzzl=2ATz=2ATAx

  • 只算微分法(推荐):首先对 l l l 进行微分:
    d l = tr [ d ( z T z ) ] = tr [ d z T z + z T d z ] = tr [ 2 z T d z ] \mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]

发现式子中带有 d z \mathrm {d}\mathbf {z} dz,于是计算 d z \mathrm {d}\mathbf {z} dz
d z = d ( A x ) = A d x \mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {A} \mathbf {x}) = \mathbf {A} \mathrm {d}\mathbf {x} dz=d(Ax)=Adx

d z \mathrm {d}\mathbf {z} dz 代入上式可得:
d l = tr [ 2 z T d z ] = tr [ 2 z T A d x ] = tr [ 2 A T z d x ] \mathrm {d} l = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathbf {A} \mathrm {d}\mathbf {x}] = \text {tr}[2\mathbf {A}^T \mathbf {z} \mathrm {d}\mathbf {x}] dl=tr[2zTdz]=tr[2zTAdx]=tr[2ATzdx]

因此,
∂ l ∂ x = 2 A T z = 2 A T A x \frac {\partial l}{\partial \mathbf {x}} = 2\mathbf {A}^T \mathbf {z} = 2\mathbf {A}^T \mathbf {A} \mathbf {x} xl=2ATz=2ATAx

  • 总结:对比两种方法,计算内容基本相同,都需要对给定的两个式子取微分,差别在于第二种方法取完微分后直接代入使用,而不是求出中间步骤的导数。这种方法无需额外记忆公式,也不会增加计算量。在 “综合例题 - 神经网络” 一节中,可以看到这种方法在复杂案例中的应用。

2.标量对矩阵求导:已知 l = z T z l = \mathbf {z}^T \mathbf {z} l=zTz z = X β \mathbf {z} = \mathbf {X} \beta z=Xβ,求 ∂ l ∂ X \frac {\partial l}{\partial \mathbf {X}} Xl

  • 使用链式法则:由于
    d l = t r [ d ( z T z ) ] = t r [ d z T z + z T d z ] = t r [ 2 z T d z ] \mathrm {d} l = \mathrm {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \mathrm {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \mathrm {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]
    d z = d ( X β ) = d X β \mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {X} \beta) = \mathrm {d}\mathbf {X} \beta dz=d(Xβ)=dXβ

因此,
∂ l ∂ z = 2 z , ∂ z ∂ X = β ⊗ I n \frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {z}, \quad \frac {\partial \mathbf {z}}{\partial \mathbf {X}} = \beta \otimes I_n zl=2z,Xz=βIn

列出各个矩阵的维度如下:
X : n × p , β : p × 1 , z : n × 1 \mathbf {X}: n \times p, \quad \beta: p \times 1, \quad \mathbf {z}: n \times 1 X:n×p,β:p×1,z:n×1
∂ l ∂ z : n × 1 , ∂ z ∂ X : n p × n \frac {\partial l}{\partial \mathbf {z}}: n \times 1, \quad \frac {\partial \mathbf {z}}{\partial \mathbf {X}}: np \times n zl:n×1,Xz:np×n


∂ l ∂ X = ∂ z ∂ X ∂ l ∂ z = 2 [ β ⊗ I n ] z ( ∂ l ∂ X : n p × 1 ) \frac {\partial l}{\partial \mathbf {X}} = \frac {\partial \mathbf {z}}{\partial \mathbf {X}} \frac {\partial l}{\partial \mathbf {z}} = 2 [\beta \otimes I_n] \mathbf {z} \quad \left (\frac {\partial l}{\partial \mathbf {X}}: np \times 1\right) Xl=Xzzl=2[βIn]z(Xl:np×1)

该结果进行向量化逆操作后可得:
∂ l ∂ X = 2 z β T = 2 X β β T ( ∂ l ∂ X : n × p ) \frac {\partial l}{\partial \mathbf {X}} = 2\mathbf {z} \beta^T = 2\mathbf {X} \beta \beta^T \quad \left (\frac {\partial l}{\partial \mathbf {X}}: n \times p\right) Xl=2zβT=2XββT(Xl:n×p)

注:可以看到这种方法比较麻烦,需要对矩阵的结构进行各种调整。这里 z \mathbf {z} z 是向量还好,如果 Z \mathbf {Z} Z 是矩阵,两个导数都不能直接相乘,例如 Z = f ( Y ) \mathbf {Z} = f (\mathbf {Y}) Z=f(Y) Y = A X + B \mathbf {Y} = \mathbf {A} \mathbf {X} + \mathbf {B} Y=AX+B。这里多说一句,对于这种特定关系,有 ∂ Z ∂ X = A T ∂ Z ∂ Y \frac {\partial \mathbf {Z}}{\partial \mathbf {X}} = \mathbf {A}^T \frac {\partial \mathbf {Z}}{\partial \mathbf {Y}} XZ=ATYZ。这个结果可以用上面的链式法则推导(但很繁琐),也可以用下面的只算微分方法非常容易地得到。因此,掌握下面这种方法,无需记忆这种特定关系。

  • 只算微分法(推荐):首先对 l l l 进行微分:
    d l = tr [ d ( z T z ) ] = tr [ d z T z + z T d z ] = tr [ 2 z T d z ] \mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]

然后计算 d z \mathrm {d}\mathbf {z} dz
d z = d ( X β ) = d X β \mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {X} \beta) = \mathrm {d}\mathbf {X} \beta dz=d(Xβ)=dXβ

将微分结果代入上式可得:
d l = tr [ 2 z T d z ] = tr [ 2 z T d X β ] = tr [ 2 β z T d X ] \mathrm {d} l = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {X} \beta] = \text {tr}[2\beta \mathbf {z}^T \mathrm {d}\mathbf {X}] dl=tr[2zTdz]=tr[2zTdXβ]=tr[2βzTdX]

因此,
∂ l ∂ X = 2 z β T = 2 X β β T \frac {\partial l}{\partial \mathbf {X}} = 2\mathbf {z} \beta^T = 2\mathbf {X} \beta \beta^T Xl=2zβT=2XββT

综合例题

Logistic 二分类

对数似然函数如下:
l = ∑ i = 1 n y i log ⁡ p i + ( 1 − y i ) log ⁡ ( 1 − p i ) = ∑ i = 1 n y i log ⁡ e x i T β 1 + e x i T β + ( 1 − y i ) log ⁡ 1 1 + e x i T β = ∑ i = 1 n y i ( x i T β − log ⁡ ( 1 + exp ⁡ ( x i T β ) ) ) = y T X β − 1 T log ⁡ ( 1 + exp ⁡ ( X β ) ) \begin {align*} l &= \sum_{i = 1}^{n} y_i \log p_i + (1 - y_i) \log (1 - p_i) \\ &= \sum_{i = 1}^{n} y_i \log \frac {e^{\mathbf {x}_i^T \beta}}{1 + e^{\mathbf {x}_i^T \beta}} + (1 - y_i) \log \frac {1}{1 + e^{\mathbf {x}_i^T \beta}} \\ &= \sum_{i = 1}^{n} y_i (\mathbf {x}_i^T \beta - \log (1 + \exp (\mathbf {x}_i^T \beta))) \\ &= \mathbf {y}^T \mathbf {X} \beta - \mathbf {1}^T \log (1 + \exp (\mathbf {X} \beta)) \end {align*} l=i=1nyilogpi+(1yi)log(1pi)=i=1nyilog1+exiTβexiTβ+(1yi)log1+exiTβ1=i=1nyi(xiTβlog(1+exp(xiTβ)))=yTXβ1Tlog(1+exp(Xβ))

最后一步整理成了矩阵形式,去掉了前面的求和符号。其实也可以带着求和符号算导数,最后再将导数整理成矩阵形式。整理成矩阵的技巧是关注目标的维度以及各个矩阵向量的维度。微分如下:
d l = tr [ y T X d β − 1 T ( 1 1 + exp ⁡ ( X β ) ⊙ d exp ⁡ ( X β ) ) ] = tr [ y T X d β − ( 1 T ⊙ 1 1 + exp ⁡ ( X β ) ) T d exp ⁡ ( X β ) ] = tr [ y T X d β − ( 1 1 + exp ⁡ ( X β ) ) T ( exp ⁡ ( X β ) ⊙ X d β ) ] = tr [ y T X d β − [ ( 1 1 + exp ⁡ ( X β ) ) ⊙ exp ⁡ ( X β ) ] T X d β ] = tr [ y T X d β − σ ( X β ) T X d β ] = tr [ ( y T − σ ( X β ) T ) X d β ] \begin {align*} \mathrm {d} l &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \mathbf {1}^T \left (\frac {1}{1 + \exp (\mathbf {X} \beta)} \odot \mathrm {d}\exp (\mathbf {X} \beta)\right)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left (\mathbf {1}^T \odot \frac {1}{1 + \exp (\mathbf {X} \beta)}\right)^T \mathrm {d}\exp (\mathbf {X} \beta)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left (\frac {1}{1 + \exp (\mathbf {X} \beta)}\right)^T \left (\exp (\mathbf {X} \beta) \odot \mathbf {X} \mathrm {d}\beta\right)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left [\left (\frac {1}{1 + \exp (\mathbf {X} \beta)}\right) \odot \exp (\mathbf {X} \beta)\right]^T \mathbf {X} \mathrm {d}\beta\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \sigma (\mathbf {X} \beta)^T \mathbf {X} \mathrm {d}\beta\right] \\ &= \text {tr}\left [(\mathbf {y}^T - \sigma (\mathbf {X} \beta)^T) \mathbf {X} \mathrm {d}\beta\right] \end {align*} dl=tr[yTXdβ1T(1+exp(Xβ)1dexp(Xβ))]=tr[yTXdβ(1T1+exp(Xβ)1)Tdexp(Xβ)]=tr[yTXdβ(1+exp(Xβ)1)T(exp(Xβ)Xdβ)]=tr[yTXdβ[(1+exp(Xβ)1)exp(Xβ)]TXdβ]=tr[yTXdβσ(Xβ)TXdβ]=tr[(yTσ(Xβ)T)Xdβ]

因此,
∇ β l = X T ( y − σ ( X β ) ) \nabla_{\beta} l = \mathbf {X}^T (\mathbf {y} - \sigma (\mathbf {X} \beta)) βl=XT(yσ(Xβ))
其中 σ ( x ) = e x 1 + e x \sigma (\mathbf {x}) = \frac {e^{\mathbf {x}}}{1 + e^{\mathbf {x}}} σ(x)=1+exex

∇ β 2 l \nabla^2_{\beta} l β2l 的过程是向量对向量求导,两端同时取微分:
d ∇ β l = − X T d σ ( X β ) = − X T [ σ ′ ( X β ) ⊙ X d β ] = − X T diag [ σ ′ ( X β ) ] X d β \begin {align*} \mathrm {d}\nabla_{\beta} l &= -\mathbf {X}^T \mathrm {d}\sigma (\mathbf {X} \beta) \\ &= -\mathbf {X}^T [\sigma'(\mathbf {X} \beta) \odot \mathbf {X} \mathrm {d}\beta] \\ &= -\mathbf {X}^T \text {diag}[\sigma'(\mathbf {X} \beta)] \mathbf {X} \mathrm {d}\beta \end {align*} dβl=XTdσ(Xβ)=XT[σ(Xβ)Xdβ]=XTdiag[σ(Xβ)]Xdβ

因此,
∇ β 2 l = − X T diag [ σ ′ ( X β ) ] X \nabla^2_{\beta} l = -\mathbf {X}^T \text {diag}[\sigma'(\mathbf {X} \beta)] \mathbf {X} β2l=XTdiag[σ(Xβ)]X

如果保留样本求和符号,可以写成
∇ β 2 l = ∂ 2 l ( β ) ∂ β ∂ β T = − ∑ i = 1 n x i x i T σ ( x i T β ) ( 1 − σ ( x i T β ) ) \nabla^2_{\beta} l = \frac {\partial^2 l (\beta)}{\partial \beta \partial \beta^T} = -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \sigma (\mathbf {x}_i^T \beta)(1 - \sigma (\mathbf {x}_i^T \beta)) β2l=ββT2l(β)=i=1nxixiTσ(xiTβ)(1σ(xiTβ))

Softmax 多分类

首先定义变量维度:
Y : n × c , y i : c × 1 \mathbf {Y}: n \times c, \quad \mathbf {y}_i: c \times 1 Y:n×c,yi:c×1
X : n × d , x i : d × 1 \mathbf {X}: n \times d, \quad \mathbf {x}_i: d \times 1 X:n×d,xi:d×1
W : d × c \mathbf {W}: d \times c W:d×c
1 c : c × 1 , 1 n : n × 1 \mathbf {1}_c: c \times 1, \quad \mathbf {1}_n: n \times 1 1c:c×1,1n:n×1

对数似然函数如下:
l = ∑ i = 1 n y i T log ⁡ exp ⁡ ( W T x i ) 1 c T exp ⁡ ( W T x i ) = ∑ i = 1 n y i T W T x i − y i T 1 c log ⁡ ( 1 c T exp ⁡ ( W T x i ) ) = ∑ i = 1 n y i T W T x i − log ⁡ ( 1 c T exp ⁡ ( W T x i ) ) = t r ( X W Y T ) − 1 n T log ⁡ [ exp ⁡ ( X W ) 1 c ] \begin {align*} l &= \sum_{i=1}^n \mathbf {y}_i^T \log \frac {\exp (\mathbf {W}^T \mathbf {x}_i)}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} \\ &= \sum_{i=1}^n \mathbf {y}_i^T \mathbf {W}^T \mathbf {x}_i - \mathbf {y}_i^T \mathbf {1}_c \log (\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)) \\ &= \sum_{i=1}^n \mathbf {y}_i^T \mathbf {W}^T \mathbf {x}_i - \log (\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)) \\ &= \mathrm {tr}(\mathbf {X} \mathbf {W} \mathbf {Y}^T) - \mathbf {1}_n^T \log [\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c] \end {align*} l=i=1nyiTlog1cTexp(WTxi)exp(WTxi)=i=1nyiTWTxiyiT1clog(1cTexp(WTxi))=i=1nyiTWTxilog(1cTexp(WTxi))=tr(XWYT)1nTlog[exp(XW)1c]

最后一步整理成了矩阵形式,去掉了前面的求和符号。其实也可以带着求和符号算导数,最后再将导数整理成矩阵形式。整理成矩阵的技巧是关注目标的维度以及各个矩阵向量的维度。微分如下:
d l = t r ( X d W Y T ) − t r ( 1 n T [ 1 exp ⁡ ( X W ) 1 c ⊙ d exp ⁡ ( X W ) 1 c ] ) = t r ( Y T X d W ) − t r ( [ 1 n ⊙ 1 exp ⁡ ( X W ) 1 c ] T d exp ⁡ ( X W ) 1 c ) = t r ( Y T X d W ) − t r ( [ 1 exp ⁡ ( X W ) 1 c ] T [ exp ⁡ ( X W ) ⊙ X d W ] 1 c ) = t r ( Y T X d W ) − t r ( [ 1 exp ⁡ ( X W ) 1 c 1 c T ] T [ exp ⁡ ( X W ) ⊙ X d W ] ) = t r ( Y T X d W ) − t r ( [ 1 exp ⁡ ( X W ) 1 c 1 c T ⊙ exp ⁡ ( X W ) ] T X d W ) = t r ( Y T X d W ) − t r ( S o f t m a x ( X W ) T X d W ) = t r ( ( Y T − S o f t m a x ( X W ) T ) X d W ) \begin {align*} \mathrm {d} l &= \mathrm {tr}(\mathbf {X} \mathrm {d}\mathbf {W} \mathbf {Y}^T) - \mathrm {tr}\left (\mathbf {1}_n^T \left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \odot \mathrm {d}\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c\right]\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\mathbf {1}_n \odot \frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c}\right]^T \mathrm {d}\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c}\right]^T \left [\exp (\mathbf {X} \mathbf {W}) \odot \mathbf {X} \mathrm {d}\mathbf {W}\right] \mathbf {1}_c\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \mathbf {1}_c^T\right]^T \left [\exp (\mathbf {X} \mathbf {W}) \odot \mathbf {X} \mathrm {d}\mathbf {W}\right]\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \mathbf {1}_c^T \odot \exp (\mathbf {X} \mathbf {W})\right]^T \mathbf {X} \mathrm {d}\mathbf {W}\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}(\mathrm {Softmax}(\mathbf {X} \mathbf {W})^T \mathbf {X} \mathrm {d}\mathbf {W}) \\ &= \mathrm {tr}((\mathbf {Y}^T - \mathrm {Softmax}(\mathbf {X} \mathbf {W})^T) \mathbf {X} \mathrm {d}\mathbf {W}) \end {align*} dl=tr(XdWYT)tr(1nT[exp(XW)1c1dexp(XW)1c])=tr(YTXdW)tr([1nexp(XW)1c1]Tdexp(XW)1c)=tr(YTXdW)tr([exp(XW)1c1]T[exp(XW)XdW]1c)=tr(YTXdW)tr([exp(XW)1c11cT]T[exp(XW)XdW])=tr(YTXdW)tr([exp(XW)1c11cTexp(XW)]TXdW)=tr(YTXdW)tr(Softmax(XW)TXdW)=tr((YTSoftmax(XW)T)XdW)

因此,
∇ W l = X T ( Y − S o f t m a x ( X W ) ) \nabla_{\mathbf {W}} l = \mathbf {X}^T (\mathbf {Y} - \mathrm {Softmax}(\mathbf {X} \mathbf {W})) Wl=XT(YSoftmax(XW))
其中 S o f t m a x ( X W ) \mathrm {Softmax}(\mathbf {X} \mathbf {W}) Softmax(XW) 是一个 n × c n \times c n×c 的矩阵,表示对 X W \mathbf {X} \mathbf {W} XW 的每行都计算
s o f t m a x ( x ) = exp ⁡ ( x ) 1 T exp ⁡ ( x ) , ( x : c × 1 ) \mathrm {softmax}(\mathbf {x}) = \frac {\exp (\mathbf {x})}{\mathbf {1}^T \exp (\mathbf {x})}, \qquad (\mathbf {x}: c \times 1) softmax(x)=1Texp(x)exp(x),(x:c×1)

如果保留样本求和符号,一阶导可以写成
∇ W l = ∑ i = 1 n x i ( y i − s o f t m a x ( W T x i ) ) T \nabla_{\mathbf {W}} l = \sum_{i = 1}^{n} \mathbf {x}_i (\mathbf {y}_i - \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i))^T Wl=i=1nxi(yisoftmax(WTxi))T

∇ W 2 l \nabla^2_{\mathbf {W}} l W2l 的过程是向量对向量求导,两端同时取微分:
d ∇ W l = − ∑ i = 1 n x i d [ s o f t m a x ( W T x i ) ) T ] = − ∑ i = 1 n x i d [ exp ⁡ ( W T x i ) 1 c T exp ⁡ ( W T x i ) ] T = − ∑ i = 1 n x i [ exp ⁡ ( W T x i ) ⊙ d W T x i 1 c T exp ⁡ ( W T x i ) − exp ⁡ ( W T x i ) ( exp ⁡ ( W T x i ) T d W T x i ) [ 1 c T exp ⁡ ( W T x i ) ] 2 ] T = − ∑ i = 1 n x i [ d i a g [ exp ⁡ ( W T x i ) ] d W T x i 1 c T exp ⁡ ( W T x i ) − exp ⁡ ( W T x i ) ( exp ⁡ ( W T x i ) T d W T x i ) [ 1 c T exp ⁡ ( W T x i ) ] 2 ] T = − ∑ i = 1 n x i x i T d W [ d i a g ( s o f t m a x ( W T x i ) ) − s o f t m a x ( W T x i ) s o f t m a x ( W T x i ) T ] T = − ∑ i = 1 n x i x i T d W D ( W T x i ) T \begin {align*} \mathrm {d}\nabla_{\mathbf {W}} l &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathrm {d}[\mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i))^T] \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathrm {d}\left [\frac {\exp (\mathbf {W}^T \mathbf {x}_i)}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \left [\frac {\exp (\mathbf {W}^T \mathbf {x}_i) \odot \mathrm {d}\mathbf {W}^T \mathbf {x}_i}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} - \frac {\exp (\mathbf {W}^T \mathbf {x}_i)(\exp (\mathbf {W}^T \mathbf {x}_i)^T \mathrm {d}\mathbf {W}^T \mathbf {x}_i)}{[\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)]^2}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \left [\frac {\mathrm {diag}[\exp (\mathbf {W}^T \mathbf {x}_i)] \mathrm {d}\mathbf {W}^T \mathbf {x}_i}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} - \frac {\exp (\mathbf {W}^T \mathbf {x}_i)(\exp (\mathbf {W}^T \mathbf {x}_i)^T \mathrm {d}\mathbf {W}^T \mathbf {x}_i)}{[\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)]^2}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \mathrm {d}\mathbf {W} \left [\mathrm {diag}(\mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i)) - \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i) \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i)^T\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \mathrm {d}\mathbf {W} \mathbf {D}(\mathbf {W}^T \mathbf {x}_i)^T \end {align*} dWl=i=1nxid[softmax(WTxi))T]=i=1nxid[1cTexp(WTxi)exp(WTxi)]T=i=1nxi[1cTexp(WTxi)exp(WTxi)dWTxi[1cTexp(WTxi)]2exp(WTxi)(exp(WTxi)TdWTxi)]T=i=1nxi[1cTexp(WTxi)diag[exp(WTxi)]dWTxi[1cTexp(WTxi)]2exp(WTxi)(exp(WTxi)TdWTxi)]T=i=1nxixiTdW[diag(softmax(WTxi))softmax(WTxi)softmax(WTxi)T]T=i=1nxixiTdWD(WTxi)T

其中
D ( a ) = d i a g ( s o f t m a x ( a ) ) − s o f t m a x ( a ) s o f t m a x ( a ) T \mathbf {D}(\mathbf {a}) = \mathrm {diag}(\mathrm {softmax}(\mathbf {a})) - \mathrm {softmax}(\mathbf {a}) \mathrm {softmax}(\mathbf {a})^T D(a)=diag(softmax(a))softmax(a)softmax(a)T

接下来进行向量化可得:
vec ( d ∇ W l ) = − ∑ i = 1 n ( D ( W T x i ) ⊗ x i x i T ) vec ( d W ) \text {vec}(\mathrm {d}\nabla_{\mathbf {W}} l) = -\sum_{i = 1}^{n} (\mathbf {D}(\mathbf {W}^T \mathbf {x}_i) \otimes \mathbf {x}_i \mathbf {x}_i^T) \text {vec}(\mathrm {d}\mathbf {W}) vec(dWl)=i=1n(D(WTxi)xixiT)vec(dW)

因此,
∇ W 2 l = − ∑ i = 1 n D ( W T x i ) T ⊗ x i x i T \nabla^2_{\mathbf {W}} l = -\sum_{i = 1}^{n} \mathbf {D}(\mathbf {W}^T \mathbf {x}_i)^T \otimes \mathbf {x}_i \mathbf {x}_i^T W2l=i=1nD(WTxi)TxixiT

神经网络

首先定义变量维度:
Y : n × c , y i : c × 1 X : n × p , x i : p × 1 W 1 : p × d , b 1 : d × 1 W 2 : d × c , b 2 : c × 1 1 c : c × 1 , 1 n : n × 1 \begin {align*} \mathbf {Y}: n \times c, \quad \mathbf {y}_i: c \times 1 \\ \mathbf {X}: n \times p, \quad \mathbf {x}_i: p \times 1 \\ \mathbf {W}_1: p \times d, \quad \mathbf {b}_1: d \times 1 \\ \mathbf {W}_2: d \times c, \quad \mathbf {b}_2: c \times 1 \\ \mathbf {1}_c: c \times 1, \quad \mathbf {1}_n: n \times 1 \end {align*} Y:n×c,yi:c×1X:n×p,xi:p×1W1:p×d,b1:d×1W2:d×c,b2:c×11c:c×1,1n:n×1

对数似然函数如下:
l = ∑ i = 1 n y i T log ⁡ s o f t m a x ( W 2 T σ ( W 1 T x i + b 1 ) + b 2 ) l = \sum_{i = 1}^{n} \mathbf {y}_i^T \log \mathrm {softmax}(\mathbf {W}_2^T \sigma (\mathbf {W}_1^T \mathbf {x}_i + \mathbf {b}_1) + \mathbf {b}_2) l=i=1nyiTlogsoftmax(W2Tσ(W1Txi+b1)+b2)

其中 s o f t m a x \mathrm {softmax} softmax 函数定义如下:
s o f t m a x ( x ) = exp ⁡ ( x ) 1 T exp ⁡ ( x ) , ( x : c × 1 ) \mathrm {softmax}(\mathbf {x}) = \frac {\exp (\mathbf {x})}{\mathbf {1}^T \exp (\mathbf {x})}, \qquad (\mathbf {x}: c \times 1) softmax(x)=1Texp(x)exp(x),(x:c×1)

我们可以将似然函数拆解成多个式子:
l = ∑ i = 1 n y i T log ⁡ s o f t m a x ( a 2 i ) a 2 i = W 2 T h 1 i + b 2 h 1 i = σ ( a 1 i ) a 1 i = W 1 T x i + b 1 \begin {align*} l &= \sum_{i = 1}^{n} \mathbf {y}_i^T \log \mathrm {softmax}(\mathbf {a}_{2i}) \\ \mathbf {a}_{2i} &= \mathbf {W}_2^T \mathbf {h}_{1i} + \mathbf {b}_2 \\ \mathbf {h}_{1i} &= \sigma (\mathbf {a}_{1i}) \\ \mathbf {a}_{1i} &= \mathbf {W}_1^T \mathbf {x}_i + \mathbf {b}_1 \end {align*} la2ih1ia1i=i=1nyiTlogsoftmax(a2i)=W2Th1i+b2=σ(a1i)=W1Txi+b1

下面去掉样本的求和符号,推导过程与上一节 s o f t m a x \mathrm {softmax} softmax 多分类类似,这里直接给出结果:
l = t r ( A 2 Y T ) − 1 n T log ⁡ [ exp ⁡ ( A 2 ) 1 c ] A 2 = H 1 W 2 + 1 n b 2 T H 1 = σ ( A 1 ) A 1 = X W 1 + 1 n b 1 T \begin {align*} l &= \mathrm {tr}(A_2 \mathbf {Y}^T) - \mathbf {1}_n^T \log [\exp (A_2) \mathbf {1}_c] \\ A_2 &= H_1 \mathbf {W}_2 + \mathbf {1}_n \mathbf {b}_2^T \\ H_1 &= \sigma (A_1) \\ A_1 &= \mathbf {X} \mathbf {W}_1 + \mathbf {1}_n \mathbf {b}_1^T \end {align*} lA2H1A1=tr(A2YT)1nTlog[exp(A2)1c]=H1W2+1nb2T=σ(A1)=XW1+1nb1T

同时也可以得到:
d l = t r ( [ ∂ l ∂ A 2 ] T d A 2 ) ( 其中  ∂ l ∂ A 2 = Y − S o f t m a x ( A 2 ) ) (4) \mathrm {d} l = \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_2}\right]^T \mathrm {d} A_2\right) \quad \left (\text {其中 } \frac {\partial l}{\partial A_2} = \mathbf {Y} - \mathrm {Softmax}(A_2)\right) \tag {4} dl=tr([A2l]TdA2)(其中 A2l=YSoftmax(A2))(4)

A 2 A_2 A2 求微分如下:
d A 2 = d H 1 W 2 + H 1 d W 2 + 1 n d b 2 T \mathrm {d} A_2 = \mathrm {d} H_1 \mathbf {W}_2 + H_ 1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n \mathrm {d}\mathbf {b}_2^T dA2=dH1W2+H1dW2+1ndb2T

代入上式可得:
d l = t r ( [ ∂ l ∂ A 2 ] T [ d H 1 W 2 + H 1 d W 2 + 1 n d b 2 T ] ) = t r ( W 2 [ ∂ l ∂ A 2 ] T d H 1 + [ ∂ l ∂ A 2 ] T H 1 d W 2 + 1 n T [ ∂ l ∂ A 2 ] d b 2 ) = t r ( [ ∂ l ∂ H 1 ] T d H 1 + [ ∂ l ∂ W 2 ] T d W 2 + [ ∂ l ∂ b 2 ] T d b 2 ) \begin {align*} \mathrm {d} l &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_2}\right]^T \left [\mathrm {d} H_1 \mathbf {W}_2 + H_1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n \mathrm {d}\mathbf {b}_2^T\right]\right) \\ &= \mathrm {tr}\left (\mathbf {W}_2 \left [\frac {\partial l}{\partial A_2}\right]^T \mathrm {d} H_1 + \left [\frac {\partial l}{\partial A_2}\right]^T H_1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n^T \left [\frac {\partial l}{\partial A_2}\right] \mathrm {d}\mathbf {b}_2\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1}\right]^T \mathrm {d} H_1 + \left [\frac {\partial l}{\partial \mathbf {W}_2}\right]^T \mathrm {d}\mathbf {W}_2 + \left [\frac {\partial l}{\partial \mathbf {b}_2}\right]^T \mathrm {d}\mathbf {b}_2\right) \end {align*} dl=tr([A2l]T[dH1W2+H1dW2+1ndb2T])=tr(W2[A2l]TdH1+[A2l]TH1dW2+1nT[A2l]db2)=tr([H1l]TdH1+[W2l]TdW2+[b2l]Tdb2)

其中
∂ l ∂ H 1 = ∂ l ∂ A 2 W 2 T , ∂ l ∂ W 2 = H 1 T ∂ l ∂ A 2 , ∂ l ∂ b 2 = [ ∂ l ∂ A 2 ] T 1 n \frac {\partial l}{\partial H_1} = \frac {\partial l}{\partial A_2} \mathbf {W}_2^T, \quad \frac {\partial l}{\partial \mathbf {W}_2} = H_1^T \frac {\partial l}{\partial A_2}, \quad \frac {\partial l}{\partial \mathbf {b}_2} = \left [\frac {\partial l}{\partial A_2}\right]^T \mathbf {1}_n H1l=A2lW2T,W2l=H1TA2l,b2l=[A2l]T1n

接下来对 H 1 H_1 H1 求微分:
d H 1 = σ ( A 1 ) ⊙ d A 1 \mathrm {d} H_1 = \sigma (A_1) \odot \mathrm {d} A_1 dH1=σ(A1)dA1

l l l 微分的第一部分可以表示为:
d l 1 = t r ( [ ∂ l ∂ H 1 ] T [ σ ′ ( A 1 ) ⊙ d A 1 ] ) = t r ( [ ∂ l ∂ H 1 ⊙ σ ′ ( A 1 ) ] T d A 1 ) = t r ( [ ∂ l ∂ A 1 ] T d A 1 ) \begin {align} \mathrm {d} l_1 &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1}\right]^T [\sigma'(A_1) \odot \mathrm {d} A_1]\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1} \odot \sigma'(A_1)\right]^T \mathrm {d} A_1\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \mathrm {d} A_1\right) \tag {5} \end {align} dl1=tr([H1l]T[σ(A1)dA1])=tr([H1lσ(A1)]TdA1)=tr([A1l]TdA1)(5)

其中
∂ l ∂ A 1 = ∂ l ∂ H 1 ⊙ σ ′ ( A 1 ) \frac {\partial l}{\partial A_1} = \frac {\partial l}{\partial H_1} \odot \sigma'(A_1) A1l=H1lσ(A1)

下面计算 A 1 A_1 A1 的微分:
d A 1 = X d W 1 + 1 n d b 1 T \mathrm {d} A_1 = \mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n \mathrm {d}\mathbf {b}_1^T dA1=XdW1+1ndb1T

代入上式可得:
d l 1 = t r ( [ ∂ l ∂ A 1 ] T [ X d W 1 + 1 n d b 1 T ] ) = t r ( [ ∂ l ∂ A 1 ] T X d W 1 + 1 n T [ ∂ l ∂ A 1 ] d b 1 ) = t r ( [ ∂ l ∂ W 1 ] T d W 1 + [ ∂ l ∂ b 1 ] T d b 1 ) \begin {align*} \mathrm {d} l_1 &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \left [\mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n \mathrm {d}\mathbf {b}_1^T\right]\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n^T \left [\frac {\partial l}{\partial A_1}\right] \mathrm {d}\mathbf {b}_1\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial \mathbf {W}_1}\right]^T \mathrm {d}\mathbf {W}_1 + \left [\frac {\partial l}{\partial \mathbf {b}_1}\right]^T \mathrm {d}\mathbf {b}_1\right) \end {align*} dl1=tr([A1l]T[XdW1+1ndb1T])=tr([A1l]TXdW1+1nT[A1l]db1)=tr([W1l]TdW1+[b1l]Tdb1)

其中
∂ l ∂ W 1 = X T ∂ l ∂ A 1 , ∂ l ∂ b 1 = [ ∂ l ∂ A 1 ] T 1 n \frac {\partial l}{\partial \mathbf {W}_1} = \mathbf {X}^T \frac {\partial l}{\partial A_1}, \quad \frac {\partial l}{\partial \mathbf {b}_1} = \left [\frac {\partial l}{\partial A_1}\right]^T \mathbf {1}_n W1l=XTA1l,b1l=[A1l]T1n

推导已完成,再一层一层带回去,即可得到 l l l W 1 \mathbf {W}_1 W1 W 2 \mathbf {W}_2 W2 b 1 \mathbf {b}_1 b1 b 2 \mathbf {b}_2 b2 的导数。

参考资料

  1. 张贤达. 矩阵分析与应用. 清华大学出版社有限公司, 2004.
  2. Fackler, Paul L. “Notes on matrix calculus.” North Carolina State University(2005).
  3. Petersen, Kaare Brandt, and Michael Syskind Pedersen. “The matrix cookbook.” Technical University of Denmark 7 (2008): 15.
  4. HU, Pili. “Matrix Calculus: Derivation and Simple Application.” (2012).
  5. Magnus, Jan R., and Heinz Neudecker. “Matrix Differential Calculus with Applications in Statistics and Econometrics.” Wiley, 2019.

Ref 0 / 1

via:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值