注:本文来自 Dwzb, 因 csdn 篇幅合并超限分篇连载,本篇为 Ref 2。
略作重排,如有内容异常,请看原文。
矩阵求导总结(二)
Dwzb 2020-01-13 00:00:00
本文承接上一篇。
链式法则
当目标函数具有层级结构时,链式法则可能较为适用。例如,对于目标函数 l = f ( Y ) l = f (\mathbf {Y}) l=f(Y),其中 Y = g ( X ) \mathbf {Y} = g (\mathbf {X}) Y=g(X),可以分别求出 ∂ l ∂ Y \frac {\partial l}{\partial \mathbf {Y}} ∂Y∂l 和 ∂ Y ∂ X \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} ∂X∂Y,再通过某种方式连接。然而,不推荐使用链式法则,原因如下:
- 计算 ∂ Y ∂ X \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} ∂X∂Y 时,可能涉及矩阵对矩阵求导或向量对向量求导,这往往会增加问题的复杂性。
- 链式法则公式受求导布局的影响,容易记错。
- 即使存在多层结构,也可以不使用链式法则完成求导,具体方法将在例题中展示。
链式法则介绍
本节介绍不同情况下的链式法则。
1.向量对向量求导:假设三个向量存在依赖关系 x → y → z \mathbf {x} \to \mathbf {y} \to \mathbf {z} x→y→z,其长度分别为 a a a、 b b b、 c c c,则链式法则如下:
-
分子布局:
∂ z ∂ x = ∂ z ∂ y ∂ y ∂ x \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {y}} \frac {\partial \mathbf {y}}{\partial \mathbf {x}} ∂x∂z=∂y∂z∂x∂y
注意维度关系: ( c × a ) = ( c × b ) × ( b × a ) (c \times a) = (c \times b) \times (b \times a) (c×a)=(c×b)×(b×a)。 -
分母布局:
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \frac {\partial \mathbf {y}}{\partial \mathbf {x}} \frac {\partial \mathbf {z}}{\partial \mathbf {y}} ∂x∂z=∂x∂y∂y∂z
注意维度关系: ( a × c ) = ( a × b ) × ( b × c ) (a \times c) = (a \times b) \times (b \times c) (a×c)=(a×b)×(b×c)。
这两个公式仅适用于三个变量均为向量的情况。可以发现,两种布局方式的公式不同。分子布局形式更符合我们对链式法则的认知,但兼容性较差。例如,当 z \mathbf {z} z 退化为标量时,标量对向量求导通常采用分母布局,而向量对向量求导则采用分子布局。布局方式的混用不仅会导致混乱,还会改变链式法则的公式,具体可见下一部分。
2.标量对向量求导:
-
分子布局:
∂ z ∂ x = ( ∂ y ∂ x ) T ∂ z ∂ y \frac {\partial z}{\partial \mathbf {x}} = \left (\frac {\partial \mathbf {y}}{\partial \mathbf {x}}\right)^T \frac {\partial z}{\partial \mathbf {y}} ∂x∂z=(∂x∂y)T∂y∂z
注意维度关系: ( a × 1 ) = ( a × b ) × ( b × 1 ) (a \times 1) = (a \times b) \times (b \times 1) (a×1)=(a×b)×(b×1)。 -
分母布局:
∂ z ∂ x = ∂ y ∂ x ∂ z ∂ y \frac {\partial z}{\partial \mathbf {x}} = \frac {\partial \mathbf {y}}{\partial \mathbf {x}} \frac {\partial z}{\partial \mathbf {y}} ∂x∂z=∂x∂y∂y∂z
注意维度关系: ( a × 1 ) = ( a × b ) × ( b × 1 ) (a \times 1) = (a \times b) \times (b \times 1) (a×1)=(a×b)×(b×1)。
可以看到,使用分母布局时,公式较为统一,但顺序与我们对链式法则公式的认知不符,难以记忆,大致是从右往左写,顺序完全相反。
如果存在更多变量,如
y
1
→
y
2
→
⋯
→
y
n
→
z
\mathbf {y}_1 \to \mathbf {y}_2 \to \cdots \to \mathbf {y}_n \to z
y1→y2→⋯→yn→z,则分母布局的链式法则公式如下:
∂
z
∂
y
1
=
∂
y
2
∂
y
1
∂
y
3
∂
y
2
⋯
∂
y
n
∂
y
n
−
1
∂
z
∂
y
n
\frac {\partial z}{\partial \mathbf {y}_1} = \frac {\partial \mathbf {y}_2}{\partial \mathbf {y}_1} \frac {\partial \mathbf {y}_3}{\partial \mathbf {y}_2} \cdots \frac {\partial \mathbf {y}_n}{\partial \mathbf {y}_{n - 1}} \frac {\partial z}{\partial \mathbf {y}_n}
∂y1∂z=∂y1∂y2∂y2∂y3⋯∂yn−1∂yn∂yn∂z
3.标量对矩阵求导:由于涉及向量化操作,改变了矩阵的结构,因此不太方便直接写出链式法则。假设依赖关系为
X
→
Y
→
z
\mathbf {X} \to \mathbf {Y} \to z
X→Y→z,两个矩阵的维度分别为
m
×
n
m \times n
m×n 和
p
×
q
p \times q
p×q,那么导数的维度如下(这里仅考虑分母布局):
∂
z
∂
X
:
m
×
n
∂
z
∂
Y
:
p
×
q
∂
Y
∂
X
:
m
n
×
p
q
\begin {align*} \frac {\partial z}{\partial \mathbf {X}} & : \quad m \times n \\ \frac {\partial z}{\partial \mathbf {Y}} & : \quad p \times q \\ \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} & : \quad mn \times pq \\ \end {align*}
∂X∂z∂Y∂z∂X∂Y:m×n:p×q:mn×pq
从矩阵维度来看,三者关系不会是
∂
z
∂
X
=
∂
Y
∂
X
∂
z
∂
Y
\frac {\partial z}{\partial \mathbf {X}} = \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} \frac {\partial z}{\partial \mathbf {Y}}
∂X∂z=∂X∂Y∂Y∂z
而可能是
vec
(
∂
z
∂
X
)
=
∂
Y
∂
X
vec
(
∂
z
∂
Y
)
\text {vec}\left (\frac {\partial z}{\partial \mathbf {X}}\right) = \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} \text {vec}\left (\frac {\partial z}{\partial \mathbf {Y}}\right)
vec(∂X∂z)=∂X∂Yvec(∂Y∂z)
虽然未查到相关资料证实,但经过几个例子的验证,该式是正确的。从下面的例题中也可以看出,即使该式成立,其计算过程也过于繁琐。
4.总结:不推荐使用链式法则。如果必须使用,则仅推荐公式(1)的用法,即使用分母布局且仅涉及向量,但该方法的适用范围过窄。下面通过两个例题展示推荐的求导方法。
例题
1.标量对向量求导:已知 l = z T z l = \mathbf {z}^T \mathbf {z} l=zTz, z = A x \mathbf {z} = \mathbf {A} \mathbf {x} z=Ax,求 ∂ l ∂ x \frac {\partial l}{\partial \mathbf {x}} ∂x∂l。
- 使用链式法则:由于
∂ l ∂ x = ∂ z ∂ x ∂ l ∂ z \frac {\partial l}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {x}} \frac {\partial l}{\partial \mathbf {z}} ∂x∂l=∂x∂z∂z∂l
因此,需要分别求出 ∂ z ∂ x \frac {\partial \mathbf {z}}{\partial \mathbf {x}} ∂x∂z 和 ∂ l ∂ z \frac {\partial l}{\partial \mathbf {z}} ∂z∂l。
首先,对
l
l
l 进行微分:
d
l
=
tr
[
d
(
z
T
z
)
]
=
tr
[
d
z
T
z
+
z
T
d
z
]
=
tr
[
2
z
T
d
z
]
\mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}]
dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]
接着,计算
d
z
\mathrm {d}\mathbf {z}
dz:
d
z
=
d
(
A
x
)
=
A
d
x
\mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {A} \mathbf {x}) = \mathbf {A} \mathrm {d}\mathbf {x}
dz=d(Ax)=Adx
因此, ∂ l ∂ z = 2 z \frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {z} ∂z∂l=2z, ∂ z ∂ x = A T \frac {\partial \mathbf {z}}{\partial \mathbf {x}} = \mathbf {A}^T ∂x∂z=AT。
最终, ∂ l ∂ x = ∂ z ∂ x ∂ l ∂ z = 2 A T z = 2 A T A x \frac {\partial l}{\partial \mathbf {x}} = \frac {\partial \mathbf {z}}{\partial \mathbf {x}} \frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {A}^T \mathbf {z} = 2\mathbf {A}^T \mathbf {A} \mathbf {x} ∂x∂l=∂x∂z∂z∂l=2ATz=2ATAx。
- 只算微分法(推荐):首先对
l
l
l 进行微分:
d l = tr [ d ( z T z ) ] = tr [ d z T z + z T d z ] = tr [ 2 z T d z ] \mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]
发现式子中带有
d
z
\mathrm {d}\mathbf {z}
dz,于是计算
d
z
\mathrm {d}\mathbf {z}
dz:
d
z
=
d
(
A
x
)
=
A
d
x
\mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {A} \mathbf {x}) = \mathbf {A} \mathrm {d}\mathbf {x}
dz=d(Ax)=Adx
将
d
z
\mathrm {d}\mathbf {z}
dz 代入上式可得:
d
l
=
tr
[
2
z
T
d
z
]
=
tr
[
2
z
T
A
d
x
]
=
tr
[
2
A
T
z
d
x
]
\mathrm {d} l = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathbf {A} \mathrm {d}\mathbf {x}] = \text {tr}[2\mathbf {A}^T \mathbf {z} \mathrm {d}\mathbf {x}]
dl=tr[2zTdz]=tr[2zTAdx]=tr[2ATzdx]
因此,
∂
l
∂
x
=
2
A
T
z
=
2
A
T
A
x
\frac {\partial l}{\partial \mathbf {x}} = 2\mathbf {A}^T \mathbf {z} = 2\mathbf {A}^T \mathbf {A} \mathbf {x}
∂x∂l=2ATz=2ATAx
- 总结:对比两种方法,计算内容基本相同,都需要对给定的两个式子取微分,差别在于第二种方法取完微分后直接代入使用,而不是求出中间步骤的导数。这种方法无需额外记忆公式,也不会增加计算量。在 “综合例题 - 神经网络” 一节中,可以看到这种方法在复杂案例中的应用。
2.标量对矩阵求导:已知 l = z T z l = \mathbf {z}^T \mathbf {z} l=zTz, z = X β \mathbf {z} = \mathbf {X} \beta z=Xβ,求 ∂ l ∂ X \frac {\partial l}{\partial \mathbf {X}} ∂X∂l。
- 使用链式法则:由于
d l = t r [ d ( z T z ) ] = t r [ d z T z + z T d z ] = t r [ 2 z T d z ] \mathrm {d} l = \mathrm {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \mathrm {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \mathrm {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]
d z = d ( X β ) = d X β \mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {X} \beta) = \mathrm {d}\mathbf {X} \beta dz=d(Xβ)=dXβ
因此,
∂
l
∂
z
=
2
z
,
∂
z
∂
X
=
β
⊗
I
n
\frac {\partial l}{\partial \mathbf {z}} = 2\mathbf {z}, \quad \frac {\partial \mathbf {z}}{\partial \mathbf {X}} = \beta \otimes I_n
∂z∂l=2z,∂X∂z=β⊗In
列出各个矩阵的维度如下:
X
:
n
×
p
,
β
:
p
×
1
,
z
:
n
×
1
\mathbf {X}: n \times p, \quad \beta: p \times 1, \quad \mathbf {z}: n \times 1
X:n×p,β:p×1,z:n×1
∂
l
∂
z
:
n
×
1
,
∂
z
∂
X
:
n
p
×
n
\frac {\partial l}{\partial \mathbf {z}}: n \times 1, \quad \frac {\partial \mathbf {z}}{\partial \mathbf {X}}: np \times n
∂z∂l:n×1,∂X∂z:np×n
则
∂
l
∂
X
=
∂
z
∂
X
∂
l
∂
z
=
2
[
β
⊗
I
n
]
z
(
∂
l
∂
X
:
n
p
×
1
)
\frac {\partial l}{\partial \mathbf {X}} = \frac {\partial \mathbf {z}}{\partial \mathbf {X}} \frac {\partial l}{\partial \mathbf {z}} = 2 [\beta \otimes I_n] \mathbf {z} \quad \left (\frac {\partial l}{\partial \mathbf {X}}: np \times 1\right)
∂X∂l=∂X∂z∂z∂l=2[β⊗In]z(∂X∂l:np×1)
该结果进行向量化逆操作后可得:
∂
l
∂
X
=
2
z
β
T
=
2
X
β
β
T
(
∂
l
∂
X
:
n
×
p
)
\frac {\partial l}{\partial \mathbf {X}} = 2\mathbf {z} \beta^T = 2\mathbf {X} \beta \beta^T \quad \left (\frac {\partial l}{\partial \mathbf {X}}: n \times p\right)
∂X∂l=2zβT=2XββT(∂X∂l:n×p)
注:可以看到这种方法比较麻烦,需要对矩阵的结构进行各种调整。这里 z \mathbf {z} z 是向量还好,如果 Z \mathbf {Z} Z 是矩阵,两个导数都不能直接相乘,例如 Z = f ( Y ) \mathbf {Z} = f (\mathbf {Y}) Z=f(Y), Y = A X + B \mathbf {Y} = \mathbf {A} \mathbf {X} + \mathbf {B} Y=AX+B。这里多说一句,对于这种特定关系,有 ∂ Z ∂ X = A T ∂ Z ∂ Y \frac {\partial \mathbf {Z}}{\partial \mathbf {X}} = \mathbf {A}^T \frac {\partial \mathbf {Z}}{\partial \mathbf {Y}} ∂X∂Z=AT∂Y∂Z。这个结果可以用上面的链式法则推导(但很繁琐),也可以用下面的只算微分方法非常容易地得到。因此,掌握下面这种方法,无需记忆这种特定关系。
- 只算微分法(推荐):首先对
l
l
l 进行微分:
d l = tr [ d ( z T z ) ] = tr [ d z T z + z T d z ] = tr [ 2 z T d z ] \mathrm {d} l = \text {tr}[\mathrm {d}(\mathbf {z}^T \mathbf {z})] = \text {tr}[\mathrm {d}\mathbf {z}^T \mathbf {z} + \mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] dl=tr[d(zTz)]=tr[dzTz+zTdz]=tr[2zTdz]
然后计算
d
z
\mathrm {d}\mathbf {z}
dz:
d
z
=
d
(
X
β
)
=
d
X
β
\mathrm {d}\mathbf {z} = \mathrm {d}(\mathbf {X} \beta) = \mathrm {d}\mathbf {X} \beta
dz=d(Xβ)=dXβ
将微分结果代入上式可得:
d
l
=
tr
[
2
z
T
d
z
]
=
tr
[
2
z
T
d
X
β
]
=
tr
[
2
β
z
T
d
X
]
\mathrm {d} l = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {z}] = \text {tr}[2\mathbf {z}^T \mathrm {d}\mathbf {X} \beta] = \text {tr}[2\beta \mathbf {z}^T \mathrm {d}\mathbf {X}]
dl=tr[2zTdz]=tr[2zTdXβ]=tr[2βzTdX]
因此,
∂
l
∂
X
=
2
z
β
T
=
2
X
β
β
T
\frac {\partial l}{\partial \mathbf {X}} = 2\mathbf {z} \beta^T = 2\mathbf {X} \beta \beta^T
∂X∂l=2zβT=2XββT
综合例题
Logistic 二分类
对数似然函数如下:
l
=
∑
i
=
1
n
y
i
log
p
i
+
(
1
−
y
i
)
log
(
1
−
p
i
)
=
∑
i
=
1
n
y
i
log
e
x
i
T
β
1
+
e
x
i
T
β
+
(
1
−
y
i
)
log
1
1
+
e
x
i
T
β
=
∑
i
=
1
n
y
i
(
x
i
T
β
−
log
(
1
+
exp
(
x
i
T
β
)
)
)
=
y
T
X
β
−
1
T
log
(
1
+
exp
(
X
β
)
)
\begin {align*} l &= \sum_{i = 1}^{n} y_i \log p_i + (1 - y_i) \log (1 - p_i) \\ &= \sum_{i = 1}^{n} y_i \log \frac {e^{\mathbf {x}_i^T \beta}}{1 + e^{\mathbf {x}_i^T \beta}} + (1 - y_i) \log \frac {1}{1 + e^{\mathbf {x}_i^T \beta}} \\ &= \sum_{i = 1}^{n} y_i (\mathbf {x}_i^T \beta - \log (1 + \exp (\mathbf {x}_i^T \beta))) \\ &= \mathbf {y}^T \mathbf {X} \beta - \mathbf {1}^T \log (1 + \exp (\mathbf {X} \beta)) \end {align*}
l=i=1∑nyilogpi+(1−yi)log(1−pi)=i=1∑nyilog1+exiTβexiTβ+(1−yi)log1+exiTβ1=i=1∑nyi(xiTβ−log(1+exp(xiTβ)))=yTXβ−1Tlog(1+exp(Xβ))
最后一步整理成了矩阵形式,去掉了前面的求和符号。其实也可以带着求和符号算导数,最后再将导数整理成矩阵形式。整理成矩阵的技巧是关注目标的维度以及各个矩阵向量的维度。微分如下:
d
l
=
tr
[
y
T
X
d
β
−
1
T
(
1
1
+
exp
(
X
β
)
⊙
d
exp
(
X
β
)
)
]
=
tr
[
y
T
X
d
β
−
(
1
T
⊙
1
1
+
exp
(
X
β
)
)
T
d
exp
(
X
β
)
]
=
tr
[
y
T
X
d
β
−
(
1
1
+
exp
(
X
β
)
)
T
(
exp
(
X
β
)
⊙
X
d
β
)
]
=
tr
[
y
T
X
d
β
−
[
(
1
1
+
exp
(
X
β
)
)
⊙
exp
(
X
β
)
]
T
X
d
β
]
=
tr
[
y
T
X
d
β
−
σ
(
X
β
)
T
X
d
β
]
=
tr
[
(
y
T
−
σ
(
X
β
)
T
)
X
d
β
]
\begin {align*} \mathrm {d} l &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \mathbf {1}^T \left (\frac {1}{1 + \exp (\mathbf {X} \beta)} \odot \mathrm {d}\exp (\mathbf {X} \beta)\right)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left (\mathbf {1}^T \odot \frac {1}{1 + \exp (\mathbf {X} \beta)}\right)^T \mathrm {d}\exp (\mathbf {X} \beta)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left (\frac {1}{1 + \exp (\mathbf {X} \beta)}\right)^T \left (\exp (\mathbf {X} \beta) \odot \mathbf {X} \mathrm {d}\beta\right)\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \left [\left (\frac {1}{1 + \exp (\mathbf {X} \beta)}\right) \odot \exp (\mathbf {X} \beta)\right]^T \mathbf {X} \mathrm {d}\beta\right] \\ &= \text {tr}\left [\mathbf {y}^T \mathbf {X} \mathrm {d}\beta - \sigma (\mathbf {X} \beta)^T \mathbf {X} \mathrm {d}\beta\right] \\ &= \text {tr}\left [(\mathbf {y}^T - \sigma (\mathbf {X} \beta)^T) \mathbf {X} \mathrm {d}\beta\right] \end {align*}
dl=tr[yTXdβ−1T(1+exp(Xβ)1⊙dexp(Xβ))]=tr[yTXdβ−(1T⊙1+exp(Xβ)1)Tdexp(Xβ)]=tr[yTXdβ−(1+exp(Xβ)1)T(exp(Xβ)⊙Xdβ)]=tr[yTXdβ−[(1+exp(Xβ)1)⊙exp(Xβ)]TXdβ]=tr[yTXdβ−σ(Xβ)TXdβ]=tr[(yT−σ(Xβ)T)Xdβ]
因此,
∇
β
l
=
X
T
(
y
−
σ
(
X
β
)
)
\nabla_{\beta} l = \mathbf {X}^T (\mathbf {y} - \sigma (\mathbf {X} \beta))
∇βl=XT(y−σ(Xβ))
其中
σ
(
x
)
=
e
x
1
+
e
x
\sigma (\mathbf {x}) = \frac {e^{\mathbf {x}}}{1 + e^{\mathbf {x}}}
σ(x)=1+exex。
求
∇
β
2
l
\nabla^2_{\beta} l
∇β2l 的过程是向量对向量求导,两端同时取微分:
d
∇
β
l
=
−
X
T
d
σ
(
X
β
)
=
−
X
T
[
σ
′
(
X
β
)
⊙
X
d
β
]
=
−
X
T
diag
[
σ
′
(
X
β
)
]
X
d
β
\begin {align*} \mathrm {d}\nabla_{\beta} l &= -\mathbf {X}^T \mathrm {d}\sigma (\mathbf {X} \beta) \\ &= -\mathbf {X}^T [\sigma'(\mathbf {X} \beta) \odot \mathbf {X} \mathrm {d}\beta] \\ &= -\mathbf {X}^T \text {diag}[\sigma'(\mathbf {X} \beta)] \mathbf {X} \mathrm {d}\beta \end {align*}
d∇βl=−XTdσ(Xβ)=−XT[σ′(Xβ)⊙Xdβ]=−XTdiag[σ′(Xβ)]Xdβ
因此,
∇
β
2
l
=
−
X
T
diag
[
σ
′
(
X
β
)
]
X
\nabla^2_{\beta} l = -\mathbf {X}^T \text {diag}[\sigma'(\mathbf {X} \beta)] \mathbf {X}
∇β2l=−XTdiag[σ′(Xβ)]X
如果保留样本求和符号,可以写成
∇
β
2
l
=
∂
2
l
(
β
)
∂
β
∂
β
T
=
−
∑
i
=
1
n
x
i
x
i
T
σ
(
x
i
T
β
)
(
1
−
σ
(
x
i
T
β
)
)
\nabla^2_{\beta} l = \frac {\partial^2 l (\beta)}{\partial \beta \partial \beta^T} = -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \sigma (\mathbf {x}_i^T \beta)(1 - \sigma (\mathbf {x}_i^T \beta))
∇β2l=∂β∂βT∂2l(β)=−i=1∑nxixiTσ(xiTβ)(1−σ(xiTβ))
Softmax 多分类
首先定义变量维度:
Y
:
n
×
c
,
y
i
:
c
×
1
\mathbf {Y}: n \times c, \quad \mathbf {y}_i: c \times 1
Y:n×c,yi:c×1
X
:
n
×
d
,
x
i
:
d
×
1
\mathbf {X}: n \times d, \quad \mathbf {x}_i: d \times 1
X:n×d,xi:d×1
W
:
d
×
c
\mathbf {W}: d \times c
W:d×c
1
c
:
c
×
1
,
1
n
:
n
×
1
\mathbf {1}_c: c \times 1, \quad \mathbf {1}_n: n \times 1
1c:c×1,1n:n×1
对数似然函数如下:
l
=
∑
i
=
1
n
y
i
T
log
exp
(
W
T
x
i
)
1
c
T
exp
(
W
T
x
i
)
=
∑
i
=
1
n
y
i
T
W
T
x
i
−
y
i
T
1
c
log
(
1
c
T
exp
(
W
T
x
i
)
)
=
∑
i
=
1
n
y
i
T
W
T
x
i
−
log
(
1
c
T
exp
(
W
T
x
i
)
)
=
t
r
(
X
W
Y
T
)
−
1
n
T
log
[
exp
(
X
W
)
1
c
]
\begin {align*} l &= \sum_{i=1}^n \mathbf {y}_i^T \log \frac {\exp (\mathbf {W}^T \mathbf {x}_i)}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} \\ &= \sum_{i=1}^n \mathbf {y}_i^T \mathbf {W}^T \mathbf {x}_i - \mathbf {y}_i^T \mathbf {1}_c \log (\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)) \\ &= \sum_{i=1}^n \mathbf {y}_i^T \mathbf {W}^T \mathbf {x}_i - \log (\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)) \\ &= \mathrm {tr}(\mathbf {X} \mathbf {W} \mathbf {Y}^T) - \mathbf {1}_n^T \log [\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c] \end {align*}
l=i=1∑nyiTlog1cTexp(WTxi)exp(WTxi)=i=1∑nyiTWTxi−yiT1clog(1cTexp(WTxi))=i=1∑nyiTWTxi−log(1cTexp(WTxi))=tr(XWYT)−1nTlog[exp(XW)1c]
最后一步整理成了矩阵形式,去掉了前面的求和符号。其实也可以带着求和符号算导数,最后再将导数整理成矩阵形式。整理成矩阵的技巧是关注目标的维度以及各个矩阵向量的维度。微分如下:
d
l
=
t
r
(
X
d
W
Y
T
)
−
t
r
(
1
n
T
[
1
exp
(
X
W
)
1
c
⊙
d
exp
(
X
W
)
1
c
]
)
=
t
r
(
Y
T
X
d
W
)
−
t
r
(
[
1
n
⊙
1
exp
(
X
W
)
1
c
]
T
d
exp
(
X
W
)
1
c
)
=
t
r
(
Y
T
X
d
W
)
−
t
r
(
[
1
exp
(
X
W
)
1
c
]
T
[
exp
(
X
W
)
⊙
X
d
W
]
1
c
)
=
t
r
(
Y
T
X
d
W
)
−
t
r
(
[
1
exp
(
X
W
)
1
c
1
c
T
]
T
[
exp
(
X
W
)
⊙
X
d
W
]
)
=
t
r
(
Y
T
X
d
W
)
−
t
r
(
[
1
exp
(
X
W
)
1
c
1
c
T
⊙
exp
(
X
W
)
]
T
X
d
W
)
=
t
r
(
Y
T
X
d
W
)
−
t
r
(
S
o
f
t
m
a
x
(
X
W
)
T
X
d
W
)
=
t
r
(
(
Y
T
−
S
o
f
t
m
a
x
(
X
W
)
T
)
X
d
W
)
\begin {align*} \mathrm {d} l &= \mathrm {tr}(\mathbf {X} \mathrm {d}\mathbf {W} \mathbf {Y}^T) - \mathrm {tr}\left (\mathbf {1}_n^T \left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \odot \mathrm {d}\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c\right]\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\mathbf {1}_n \odot \frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c}\right]^T \mathrm {d}\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c}\right]^T \left [\exp (\mathbf {X} \mathbf {W}) \odot \mathbf {X} \mathrm {d}\mathbf {W}\right] \mathbf {1}_c\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \mathbf {1}_c^T\right]^T \left [\exp (\mathbf {X} \mathbf {W}) \odot \mathbf {X} \mathrm {d}\mathbf {W}\right]\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}\left (\left [\frac {1}{\exp (\mathbf {X} \mathbf {W}) \mathbf {1}_c} \mathbf {1}_c^T \odot \exp (\mathbf {X} \mathbf {W})\right]^T \mathbf {X} \mathrm {d}\mathbf {W}\right) \\ &= \mathrm {tr}(\mathbf {Y}^T \mathbf {X} \mathrm {d}\mathbf {W}) - \mathrm {tr}(\mathrm {Softmax}(\mathbf {X} \mathbf {W})^T \mathbf {X} \mathrm {d}\mathbf {W}) \\ &= \mathrm {tr}((\mathbf {Y}^T - \mathrm {Softmax}(\mathbf {X} \mathbf {W})^T) \mathbf {X} \mathrm {d}\mathbf {W}) \end {align*}
dl=tr(XdWYT)−tr(1nT[exp(XW)1c1⊙dexp(XW)1c])=tr(YTXdW)−tr([1n⊙exp(XW)1c1]Tdexp(XW)1c)=tr(YTXdW)−tr([exp(XW)1c1]T[exp(XW)⊙XdW]1c)=tr(YTXdW)−tr([exp(XW)1c11cT]T[exp(XW)⊙XdW])=tr(YTXdW)−tr([exp(XW)1c11cT⊙exp(XW)]TXdW)=tr(YTXdW)−tr(Softmax(XW)TXdW)=tr((YT−Softmax(XW)T)XdW)
因此,
∇
W
l
=
X
T
(
Y
−
S
o
f
t
m
a
x
(
X
W
)
)
\nabla_{\mathbf {W}} l = \mathbf {X}^T (\mathbf {Y} - \mathrm {Softmax}(\mathbf {X} \mathbf {W}))
∇Wl=XT(Y−Softmax(XW))
其中
S
o
f
t
m
a
x
(
X
W
)
\mathrm {Softmax}(\mathbf {X} \mathbf {W})
Softmax(XW) 是一个
n
×
c
n \times c
n×c 的矩阵,表示对
X
W
\mathbf {X} \mathbf {W}
XW 的每行都计算
s
o
f
t
m
a
x
(
x
)
=
exp
(
x
)
1
T
exp
(
x
)
,
(
x
:
c
×
1
)
\mathrm {softmax}(\mathbf {x}) = \frac {\exp (\mathbf {x})}{\mathbf {1}^T \exp (\mathbf {x})}, \qquad (\mathbf {x}: c \times 1)
softmax(x)=1Texp(x)exp(x),(x:c×1)
如果保留样本求和符号,一阶导可以写成
∇
W
l
=
∑
i
=
1
n
x
i
(
y
i
−
s
o
f
t
m
a
x
(
W
T
x
i
)
)
T
\nabla_{\mathbf {W}} l = \sum_{i = 1}^{n} \mathbf {x}_i (\mathbf {y}_i - \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i))^T
∇Wl=i=1∑nxi(yi−softmax(WTxi))T
求
∇
W
2
l
\nabla^2_{\mathbf {W}} l
∇W2l 的过程是向量对向量求导,两端同时取微分:
d
∇
W
l
=
−
∑
i
=
1
n
x
i
d
[
s
o
f
t
m
a
x
(
W
T
x
i
)
)
T
]
=
−
∑
i
=
1
n
x
i
d
[
exp
(
W
T
x
i
)
1
c
T
exp
(
W
T
x
i
)
]
T
=
−
∑
i
=
1
n
x
i
[
exp
(
W
T
x
i
)
⊙
d
W
T
x
i
1
c
T
exp
(
W
T
x
i
)
−
exp
(
W
T
x
i
)
(
exp
(
W
T
x
i
)
T
d
W
T
x
i
)
[
1
c
T
exp
(
W
T
x
i
)
]
2
]
T
=
−
∑
i
=
1
n
x
i
[
d
i
a
g
[
exp
(
W
T
x
i
)
]
d
W
T
x
i
1
c
T
exp
(
W
T
x
i
)
−
exp
(
W
T
x
i
)
(
exp
(
W
T
x
i
)
T
d
W
T
x
i
)
[
1
c
T
exp
(
W
T
x
i
)
]
2
]
T
=
−
∑
i
=
1
n
x
i
x
i
T
d
W
[
d
i
a
g
(
s
o
f
t
m
a
x
(
W
T
x
i
)
)
−
s
o
f
t
m
a
x
(
W
T
x
i
)
s
o
f
t
m
a
x
(
W
T
x
i
)
T
]
T
=
−
∑
i
=
1
n
x
i
x
i
T
d
W
D
(
W
T
x
i
)
T
\begin {align*} \mathrm {d}\nabla_{\mathbf {W}} l &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathrm {d}[\mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i))^T] \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathrm {d}\left [\frac {\exp (\mathbf {W}^T \mathbf {x}_i)}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \left [\frac {\exp (\mathbf {W}^T \mathbf {x}_i) \odot \mathrm {d}\mathbf {W}^T \mathbf {x}_i}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} - \frac {\exp (\mathbf {W}^T \mathbf {x}_i)(\exp (\mathbf {W}^T \mathbf {x}_i)^T \mathrm {d}\mathbf {W}^T \mathbf {x}_i)}{[\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)]^2}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \left [\frac {\mathrm {diag}[\exp (\mathbf {W}^T \mathbf {x}_i)] \mathrm {d}\mathbf {W}^T \mathbf {x}_i}{\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)} - \frac {\exp (\mathbf {W}^T \mathbf {x}_i)(\exp (\mathbf {W}^T \mathbf {x}_i)^T \mathrm {d}\mathbf {W}^T \mathbf {x}_i)}{[\mathbf {1}_c^T \exp (\mathbf {W}^T \mathbf {x}_i)]^2}\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \mathrm {d}\mathbf {W} \left [\mathrm {diag}(\mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i)) - \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i) \mathrm {softmax}(\mathbf {W}^T \mathbf {x}_i)^T\right]^T \\ &= -\sum_{i = 1}^{n} \mathbf {x}_i \mathbf {x}_i^T \mathrm {d}\mathbf {W} \mathbf {D}(\mathbf {W}^T \mathbf {x}_i)^T \end {align*}
d∇Wl=−i=1∑nxid[softmax(WTxi))T]=−i=1∑nxid[1cTexp(WTxi)exp(WTxi)]T=−i=1∑nxi[1cTexp(WTxi)exp(WTxi)⊙dWTxi−[1cTexp(WTxi)]2exp(WTxi)(exp(WTxi)TdWTxi)]T=−i=1∑nxi[1cTexp(WTxi)diag[exp(WTxi)]dWTxi−[1cTexp(WTxi)]2exp(WTxi)(exp(WTxi)TdWTxi)]T=−i=1∑nxixiTdW[diag(softmax(WTxi))−softmax(WTxi)softmax(WTxi)T]T=−i=1∑nxixiTdWD(WTxi)T
其中
D
(
a
)
=
d
i
a
g
(
s
o
f
t
m
a
x
(
a
)
)
−
s
o
f
t
m
a
x
(
a
)
s
o
f
t
m
a
x
(
a
)
T
\mathbf {D}(\mathbf {a}) = \mathrm {diag}(\mathrm {softmax}(\mathbf {a})) - \mathrm {softmax}(\mathbf {a}) \mathrm {softmax}(\mathbf {a})^T
D(a)=diag(softmax(a))−softmax(a)softmax(a)T
接下来进行向量化可得:
vec
(
d
∇
W
l
)
=
−
∑
i
=
1
n
(
D
(
W
T
x
i
)
⊗
x
i
x
i
T
)
vec
(
d
W
)
\text {vec}(\mathrm {d}\nabla_{\mathbf {W}} l) = -\sum_{i = 1}^{n} (\mathbf {D}(\mathbf {W}^T \mathbf {x}_i) \otimes \mathbf {x}_i \mathbf {x}_i^T) \text {vec}(\mathrm {d}\mathbf {W})
vec(d∇Wl)=−i=1∑n(D(WTxi)⊗xixiT)vec(dW)
因此,
∇
W
2
l
=
−
∑
i
=
1
n
D
(
W
T
x
i
)
T
⊗
x
i
x
i
T
\nabla^2_{\mathbf {W}} l = -\sum_{i = 1}^{n} \mathbf {D}(\mathbf {W}^T \mathbf {x}_i)^T \otimes \mathbf {x}_i \mathbf {x}_i^T
∇W2l=−i=1∑nD(WTxi)T⊗xixiT
神经网络
首先定义变量维度:
Y
:
n
×
c
,
y
i
:
c
×
1
X
:
n
×
p
,
x
i
:
p
×
1
W
1
:
p
×
d
,
b
1
:
d
×
1
W
2
:
d
×
c
,
b
2
:
c
×
1
1
c
:
c
×
1
,
1
n
:
n
×
1
\begin {align*} \mathbf {Y}: n \times c, \quad \mathbf {y}_i: c \times 1 \\ \mathbf {X}: n \times p, \quad \mathbf {x}_i: p \times 1 \\ \mathbf {W}_1: p \times d, \quad \mathbf {b}_1: d \times 1 \\ \mathbf {W}_2: d \times c, \quad \mathbf {b}_2: c \times 1 \\ \mathbf {1}_c: c \times 1, \quad \mathbf {1}_n: n \times 1 \end {align*}
Y:n×c,yi:c×1X:n×p,xi:p×1W1:p×d,b1:d×1W2:d×c,b2:c×11c:c×1,1n:n×1
对数似然函数如下:
l
=
∑
i
=
1
n
y
i
T
log
s
o
f
t
m
a
x
(
W
2
T
σ
(
W
1
T
x
i
+
b
1
)
+
b
2
)
l = \sum_{i = 1}^{n} \mathbf {y}_i^T \log \mathrm {softmax}(\mathbf {W}_2^T \sigma (\mathbf {W}_1^T \mathbf {x}_i + \mathbf {b}_1) + \mathbf {b}_2)
l=i=1∑nyiTlogsoftmax(W2Tσ(W1Txi+b1)+b2)
其中
s
o
f
t
m
a
x
\mathrm {softmax}
softmax 函数定义如下:
s
o
f
t
m
a
x
(
x
)
=
exp
(
x
)
1
T
exp
(
x
)
,
(
x
:
c
×
1
)
\mathrm {softmax}(\mathbf {x}) = \frac {\exp (\mathbf {x})}{\mathbf {1}^T \exp (\mathbf {x})}, \qquad (\mathbf {x}: c \times 1)
softmax(x)=1Texp(x)exp(x),(x:c×1)
我们可以将似然函数拆解成多个式子:
l
=
∑
i
=
1
n
y
i
T
log
s
o
f
t
m
a
x
(
a
2
i
)
a
2
i
=
W
2
T
h
1
i
+
b
2
h
1
i
=
σ
(
a
1
i
)
a
1
i
=
W
1
T
x
i
+
b
1
\begin {align*} l &= \sum_{i = 1}^{n} \mathbf {y}_i^T \log \mathrm {softmax}(\mathbf {a}_{2i}) \\ \mathbf {a}_{2i} &= \mathbf {W}_2^T \mathbf {h}_{1i} + \mathbf {b}_2 \\ \mathbf {h}_{1i} &= \sigma (\mathbf {a}_{1i}) \\ \mathbf {a}_{1i} &= \mathbf {W}_1^T \mathbf {x}_i + \mathbf {b}_1 \end {align*}
la2ih1ia1i=i=1∑nyiTlogsoftmax(a2i)=W2Th1i+b2=σ(a1i)=W1Txi+b1
下面去掉样本的求和符号,推导过程与上一节
s
o
f
t
m
a
x
\mathrm {softmax}
softmax 多分类类似,这里直接给出结果:
l
=
t
r
(
A
2
Y
T
)
−
1
n
T
log
[
exp
(
A
2
)
1
c
]
A
2
=
H
1
W
2
+
1
n
b
2
T
H
1
=
σ
(
A
1
)
A
1
=
X
W
1
+
1
n
b
1
T
\begin {align*} l &= \mathrm {tr}(A_2 \mathbf {Y}^T) - \mathbf {1}_n^T \log [\exp (A_2) \mathbf {1}_c] \\ A_2 &= H_1 \mathbf {W}_2 + \mathbf {1}_n \mathbf {b}_2^T \\ H_1 &= \sigma (A_1) \\ A_1 &= \mathbf {X} \mathbf {W}_1 + \mathbf {1}_n \mathbf {b}_1^T \end {align*}
lA2H1A1=tr(A2YT)−1nTlog[exp(A2)1c]=H1W2+1nb2T=σ(A1)=XW1+1nb1T
同时也可以得到:
d
l
=
t
r
(
[
∂
l
∂
A
2
]
T
d
A
2
)
(
其中
∂
l
∂
A
2
=
Y
−
S
o
f
t
m
a
x
(
A
2
)
)
(4)
\mathrm {d} l = \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_2}\right]^T \mathrm {d} A_2\right) \quad \left (\text {其中 } \frac {\partial l}{\partial A_2} = \mathbf {Y} - \mathrm {Softmax}(A_2)\right) \tag {4}
dl=tr([∂A2∂l]TdA2)(其中 ∂A2∂l=Y−Softmax(A2))(4)
对
A
2
A_2
A2 求微分如下:
d
A
2
=
d
H
1
W
2
+
H
1
d
W
2
+
1
n
d
b
2
T
\mathrm {d} A_2 = \mathrm {d} H_1 \mathbf {W}_2 + H_ 1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n \mathrm {d}\mathbf {b}_2^T
dA2=dH1W2+H1dW2+1ndb2T
代入上式可得:
d
l
=
t
r
(
[
∂
l
∂
A
2
]
T
[
d
H
1
W
2
+
H
1
d
W
2
+
1
n
d
b
2
T
]
)
=
t
r
(
W
2
[
∂
l
∂
A
2
]
T
d
H
1
+
[
∂
l
∂
A
2
]
T
H
1
d
W
2
+
1
n
T
[
∂
l
∂
A
2
]
d
b
2
)
=
t
r
(
[
∂
l
∂
H
1
]
T
d
H
1
+
[
∂
l
∂
W
2
]
T
d
W
2
+
[
∂
l
∂
b
2
]
T
d
b
2
)
\begin {align*} \mathrm {d} l &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_2}\right]^T \left [\mathrm {d} H_1 \mathbf {W}_2 + H_1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n \mathrm {d}\mathbf {b}_2^T\right]\right) \\ &= \mathrm {tr}\left (\mathbf {W}_2 \left [\frac {\partial l}{\partial A_2}\right]^T \mathrm {d} H_1 + \left [\frac {\partial l}{\partial A_2}\right]^T H_1 \mathrm {d}\mathbf {W}_2 + \mathbf {1}_n^T \left [\frac {\partial l}{\partial A_2}\right] \mathrm {d}\mathbf {b}_2\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1}\right]^T \mathrm {d} H_1 + \left [\frac {\partial l}{\partial \mathbf {W}_2}\right]^T \mathrm {d}\mathbf {W}_2 + \left [\frac {\partial l}{\partial \mathbf {b}_2}\right]^T \mathrm {d}\mathbf {b}_2\right) \end {align*}
dl=tr([∂A2∂l]T[dH1W2+H1dW2+1ndb2T])=tr(W2[∂A2∂l]TdH1+[∂A2∂l]TH1dW2+1nT[∂A2∂l]db2)=tr([∂H1∂l]TdH1+[∂W2∂l]TdW2+[∂b2∂l]Tdb2)
其中
∂
l
∂
H
1
=
∂
l
∂
A
2
W
2
T
,
∂
l
∂
W
2
=
H
1
T
∂
l
∂
A
2
,
∂
l
∂
b
2
=
[
∂
l
∂
A
2
]
T
1
n
\frac {\partial l}{\partial H_1} = \frac {\partial l}{\partial A_2} \mathbf {W}_2^T, \quad \frac {\partial l}{\partial \mathbf {W}_2} = H_1^T \frac {\partial l}{\partial A_2}, \quad \frac {\partial l}{\partial \mathbf {b}_2} = \left [\frac {\partial l}{\partial A_2}\right]^T \mathbf {1}_n
∂H1∂l=∂A2∂lW2T,∂W2∂l=H1T∂A2∂l,∂b2∂l=[∂A2∂l]T1n
接下来对
H
1
H_1
H1 求微分:
d
H
1
=
σ
(
A
1
)
⊙
d
A
1
\mathrm {d} H_1 = \sigma (A_1) \odot \mathrm {d} A_1
dH1=σ(A1)⊙dA1
则
l
l
l 微分的第一部分可以表示为:
d
l
1
=
t
r
(
[
∂
l
∂
H
1
]
T
[
σ
′
(
A
1
)
⊙
d
A
1
]
)
=
t
r
(
[
∂
l
∂
H
1
⊙
σ
′
(
A
1
)
]
T
d
A
1
)
=
t
r
(
[
∂
l
∂
A
1
]
T
d
A
1
)
\begin {align} \mathrm {d} l_1 &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1}\right]^T [\sigma'(A_1) \odot \mathrm {d} A_1]\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial H_1} \odot \sigma'(A_1)\right]^T \mathrm {d} A_1\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \mathrm {d} A_1\right) \tag {5} \end {align}
dl1=tr([∂H1∂l]T[σ′(A1)⊙dA1])=tr([∂H1∂l⊙σ′(A1)]TdA1)=tr([∂A1∂l]TdA1)(5)
其中
∂
l
∂
A
1
=
∂
l
∂
H
1
⊙
σ
′
(
A
1
)
\frac {\partial l}{\partial A_1} = \frac {\partial l}{\partial H_1} \odot \sigma'(A_1)
∂A1∂l=∂H1∂l⊙σ′(A1)
下面计算
A
1
A_1
A1 的微分:
d
A
1
=
X
d
W
1
+
1
n
d
b
1
T
\mathrm {d} A_1 = \mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n \mathrm {d}\mathbf {b}_1^T
dA1=XdW1+1ndb1T
代入上式可得:
d
l
1
=
t
r
(
[
∂
l
∂
A
1
]
T
[
X
d
W
1
+
1
n
d
b
1
T
]
)
=
t
r
(
[
∂
l
∂
A
1
]
T
X
d
W
1
+
1
n
T
[
∂
l
∂
A
1
]
d
b
1
)
=
t
r
(
[
∂
l
∂
W
1
]
T
d
W
1
+
[
∂
l
∂
b
1
]
T
d
b
1
)
\begin {align*} \mathrm {d} l_1 &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \left [\mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n \mathrm {d}\mathbf {b}_1^T\right]\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial A_1}\right]^T \mathbf {X} \mathrm {d}\mathbf {W}_1 + \mathbf {1}_n^T \left [\frac {\partial l}{\partial A_1}\right] \mathrm {d}\mathbf {b}_1\right) \\ &= \mathrm {tr}\left (\left [\frac {\partial l}{\partial \mathbf {W}_1}\right]^T \mathrm {d}\mathbf {W}_1 + \left [\frac {\partial l}{\partial \mathbf {b}_1}\right]^T \mathrm {d}\mathbf {b}_1\right) \end {align*}
dl1=tr([∂A1∂l]T[XdW1+1ndb1T])=tr([∂A1∂l]TXdW1+1nT[∂A1∂l]db1)=tr([∂W1∂l]TdW1+[∂b1∂l]Tdb1)
其中
∂
l
∂
W
1
=
X
T
∂
l
∂
A
1
,
∂
l
∂
b
1
=
[
∂
l
∂
A
1
]
T
1
n
\frac {\partial l}{\partial \mathbf {W}_1} = \mathbf {X}^T \frac {\partial l}{\partial A_1}, \quad \frac {\partial l}{\partial \mathbf {b}_1} = \left [\frac {\partial l}{\partial A_1}\right]^T \mathbf {1}_n
∂W1∂l=XT∂A1∂l,∂b1∂l=[∂A1∂l]T1n
推导已完成,再一层一层带回去,即可得到 l l l 对 W 1 \mathbf {W}_1 W1、 W 2 \mathbf {W}_2 W2、 b 1 \mathbf {b}_1 b1、 b 2 \mathbf {b}_2 b2 的导数。
参考资料
- 张贤达. 矩阵分析与应用. 清华大学出版社有限公司, 2004.
- Fackler, Paul L. “Notes on matrix calculus.” North Carolina State University(2005).
- Petersen, Kaare Brandt, and Michael Syskind Pedersen. “The matrix cookbook.” Technical University of Denmark 7 (2008): 15.
- HU, Pili. “Matrix Calculus: Derivation and Simple Application.” (2012).
- Magnus, Jan R., and Heinz Neudecker. “Matrix Differential Calculus with Applications in Statistics and Econometrics.” Wiley, 2019.
Ref 0 / 1
-
矩阵求导 Ref 0-优快云博客
https://blog.youkuaiyun.com/u013669912/article/details/147229384 -
矩阵求导 Ref 1-优快云博客
https://blog.youkuaiyun.com/u013669912/article/details/146126347
via:
- 矩阵求导总结(二) | Dwzb’s Blog_
https://wzbtech.com/tech/matrix-derivatives2.html