矩阵求导 Ref 1

注:本文来自 Dwzb, 因 csdn 篇幅合并超限分篇连载,本篇为 Ref 1

略作重排,如有内容异常,请看原文。


矩阵求导总结(一)

Dwzb 2020-01-12 00:00:00

标量对向量或矩阵求导

基本方法

y y y 是标量, x \mathbf {x} x 是向量, A \mathbf {A} A 是矩阵。标量对向量或矩阵求导,即对逐个元素求导。

  • ∂ y ∂ x \frac {\partial y}{\partial \mathbf {x}} xy 的结果是一个与 x \mathbf {x} x 维度相同的向量。

  • ∂ y ∂ A \frac {\partial y}{\partial \mathbf {A}} Ay 的结果是一个与 A \mathbf {A} A 维度相同的矩阵。

实际应用中,一个类似这样的公式 l = ( y − X β ) T ( y − X β ) l = (\mathbf {y} - \mathbf {X}\beta)^T (\mathbf {y} - \mathbf {X}\beta) l=(yXβ)T(yXβ),求 ∂ l ∂ β \frac {\partial l}{\partial \beta} βl,有两种思路:

  • 将矩阵展开,变成标量形式:加各种 ∑ i = 1 n \sum_{i = 1}^{n} i=1n,用 l l l 对每个 β i \beta_i βi 求导后,按照求导后应有的维度,把结果拼起来。

    • l l l 形式较简单时适用,复杂形式请用微分法。
  • 微分法:右边套一个迹,等式两端同时取微分,目标是写成这种形式 d l = tr ( b T d β ) \mathrm {d} l = \text {tr}(\mathbf {b}^T \mathrm {d}\beta) dl=tr(bTdβ),则可得 ∂ l ∂ β = b \frac {\partial l}{\partial \beta} = \mathbf {b} βl=b

    • 如果是标量对矩阵求导也一样,写成这种形式 d l = tr ( A T d X ) \mathrm {d} l = \text {tr}(\mathbf {A}^T \mathrm {d}\mathbf {X}) dl=tr(ATdX),则 ∂ l ∂ X = A \frac {\partial l}{\partial \mathbf {X}} = \mathbf {A} Xl=A

      套迹取微分后的推导,主要用到微分运算法则和迹的性质,二者都会列在下面。其他说明:

      • 右边可以套一个迹,是因为等式左右两边都是标量;取迹的目的是方便右侧变形,而迹保持不变,举例如下。比如最后推出这种形式: d l = tr ( b d β T ) \mathrm {d} l = \text {tr}(\mathbf {b} \mathrm {d}\beta^T) dl=tr(bdβT),则 ∂ l ∂ β = b \frac {\partial l}{\partial \beta} = \mathbf {b} βl=b。这是用到了迹内转置、交换位置的性质。

      • 经常等式右侧的 d β \mathrm {d}\beta dβ 不在最后,要用迹内交换位置的性质,交换位置的原则是保持矩阵相乘有意义,这也是减少计算错误的有效手段。经常是 d β \mathrm {d}\beta dβ 后面的一整块直接移到最前面。

    • 微分 d X \mathrm {d}\mathbf {X} dX X \mathbf {X} X 维度相同,这个性质可以帮助判断是否保持了矩阵相乘有意义。

微分运算法则

  • 常数微分 d X = O \mathrm {d}\mathbf {X} = \mathbf {O} dX=O,如果 X \mathbf {X} X 由常数组成, O \mathbf {O} O X \mathbf {X} X 维度相同。

  • 微分加减法 d ( X + Y ) = d X + d Y \mathrm {d}(\mathbf {X} + \mathbf {Y}) = \mathrm {d}\mathbf {X} + \mathrm {d}\mathbf {Y} d(X+Y)=dX+dY d ( X − Y ) = d X − d Y \mathrm {d}(\mathbf {X} - \mathbf {Y}) = \mathrm {d}\mathbf {X} - \mathrm {d}\mathbf {Y} d(XY)=dXdY

  • 微分乘法 d ( X Y ) = ( d X ) Y + X ( d Y ) \mathrm {d}(\mathbf {X}\mathbf {Y}) = (\mathrm {d}\mathbf {X})\mathbf {Y} + \mathbf {X}(\mathrm {d}\mathbf {Y}) d(XY)=(dX)Y+X(dY)

  • 微分转置 d ( X T ) = ( d X ) T \mathrm {d}(\mathbf {X}^T) = (\mathrm {d}\mathbf {X})^T d(XT)=(dX)T

  • 微分的迹 d tr ( X ) = tr ( d X ) \mathrm {d}\text {tr}(\mathbf {X}) = \text {tr}(\mathrm {d}\mathbf {X}) dtr(X)=tr(dX)

  • 微分哈达马乘积 d ( X ⊙ Y ) = X ⊙ d Y + d X ⊙ Y \mathrm {d}(\mathbf {X} \odot \mathbf {Y}) = \mathbf {X} \odot \mathrm {d}\mathbf {Y} + \mathrm {d}\mathbf {X} \odot \mathbf {Y} d(XY)=XdY+dXY

  • 逐元素函数微分 d σ ( X ) = σ ′ ( X ) ⊙ d X \mathrm {d}\sigma (\mathbf {X}) = \sigma'(\mathbf {X}) \odot \mathrm {d}\mathbf {X} dσ(X)=σ(X)dX,其中 σ \sigma σ 是对 X \mathbf {X} X 中每个元素进行函数变换,结果与 X \mathbf {X} X 维度相同;求导结果的矩阵每个元素为 σ ′ ( x i j ) d x i j \sigma'(x_{ij})\mathrm {d} x_{ij} σ(xij)dxij

  • 逆矩阵微分 d X − 1 = − X − 1 d X X − 1 \mathrm {d}\mathbf {X}^{-1} = -\mathbf {X}^{-1}\mathrm {d}\mathbf {X} \mathbf {X}^{-1} dX1=X1dXX1。此式可通过 X X − 1 = I \mathbf {X}\mathbf {X}^{-1} = \mathbf {I} XX1=I 左右两侧求微分推得。

  • 行列式微分 d ∣ X ∣ = ∣ X ∣ tr ( X − 1 d X ) \mathrm {d}|\mathbf {X}| = |\mathbf {X}|\text {tr}(\mathbf {X}^{-1}\mathrm {d}\mathbf {X}) dX=Xtr(X1dX),这里默认 X \mathbf {X} X 可逆,因为如果不可逆 ∣ X ∣ |\mathbf {X}| X 就是 0 了。

    更一般的表示是 d ∣ X ∣ = tr ( X # d X ) \mathrm {d}|\mathbf {X}| = \text {tr}(\mathbf {X}^{\#}\mathrm {d}\mathbf {X}) dX=tr(X#dX),其中 X # \mathbf {X}^{\#} X# X \mathbf {X} X 的伴随矩阵。

    直观理解: ∣ X ∣ = ∑ j = 1 n x i j X j i # |\mathbf {X}| = \sum_{j = 1}^{n} x_{ij}\mathbf {X}^{\#}_{ji} X=j=1nxijXji#,这对任意 i i i 都成立,所以 ∣ X ∣ |\mathbf {X}| X x i j x_{ij} xij 的导数就应该是 X j i # \mathbf {X}^{\#}_{ji} Xji#,因此 ∂ ∣ X ∣ ∂ X = X # T \frac {\partial |\mathbf {X}|}{\partial \mathbf {X}} = \mathbf {X}^{\#T} XX=X#T,所以微分形式就是 d ∣ X ∣ = tr ( X # d X ) \mathrm {d}|\mathbf {X}| = \text {tr}(\mathbf {X}^{\#}\mathrm {d}\mathbf {X}) dX=tr(X#dX)

    注:如果这样写, n ∣ X ∣ = ∑ i = 1 n ∑ j = 1 n x i j X j i # n|\mathbf {X}| = \sum_{i = 1}^{n}\sum_{j = 1}^{n} x_{ij}\mathbf {X}^{\#}_{ji} nX=i=1nj=1nxijXji#,那岂不是 ∂ ∣ X ∣ ∂ X = 1 n X # T \frac {\partial |\mathbf {X}|}{\partial \mathbf {X}} = \frac {1}{n}\mathbf {X}^{\#T} XX=n1X#T?这个式子不对,因为 X m n # \mathbf {X}^{\#}_{mn} Xmn# 里也会含有一些 x i j x_{ij} xij 的项,所以没有那么简单(而上面 X j i # \mathbf {X}^{\#}_{ji} Xji# 中确实不含 x i j x_{ij} xij 的项)。

迹的性质

  • 标量的迹等于自身 tr ( a ) = a \text {tr}(a) = a tr(a)=a

  • 转置 tr ( A T ) = tr ( A ) \text {tr}(\mathbf {A}^T) = \text {tr}(\mathbf {A}) tr(AT)=tr(A)

  • 线性 tr ( A ± B ) = tr ( A ) ± tr ( B ) \text {tr}(\mathbf {A} \pm \mathbf {B}) = \text {tr}(\mathbf {A}) \pm \text {tr}(\mathbf {B}) tr(A±B)=tr(A)±tr(B)

  • 交换 tr ( A T B ) = tr ( B T A ) \text {tr}(\mathbf {A}^ T\mathbf {B}) = \text {tr}(\mathbf {B}^T\mathbf {A}) tr(ATB)=tr(BTA),其中 A \mathbf {A} A B \mathbf {B} B 维度相同,迹结果等于 ∑ i , j A i j B i j \sum_{i,j} A_{ij} B_{ij} i,jAijBij。类似地有: tr ( A T ( B ⊙ C ) ) = tr ( ( A ⊙ B ) T C ) \text {tr}(\mathbf {A}^T (\mathbf {B} \odot \mathbf {C})) = \text {tr}((\mathbf {A} \odot \mathbf {B})^T\mathbf {C}) tr(AT(BC))=tr((AB)TC),其中 A , B , C \mathbf {A}, \mathbf {B}, \mathbf {C} A,B,C 维度相同,迹结果为 ∑ i , j A i j B i j C i j \sum_{i,j} A_{ij} B_{ij} C_{ij} i,jAijBijCij

微分法的背后原理

为什么标量对向量求导,写成 d l = tr ( b T d β ) \mathrm {d} l = \text {tr}(\mathbf {b}^T \mathrm {d}\beta) dl=tr(bTdβ),则 ∂ l ∂ β = b \frac {\partial l}{\partial \beta} = \mathbf {b} βl=b;标量对矩阵求导,写成 d l = tr ( A T d X ) \mathrm {d} l = \text {tr}(\mathbf {A}^T\mathrm {d}\mathbf {X}) dl=tr(ATdX),则 ∂ l ∂ X = A \frac {\partial l}{\partial \mathbf {X}} = \mathbf {A} Xl=A

  • 标量对向量求导:等式右侧其实是 ∑ i b i d β i \sum_{i} b_i\mathrm {d}\beta_i ibidβi,那么 ∂ l ∂ β i = b i \frac {\partial l}{\partial \beta_i} = b_i βil=bi,自然可得 ∂ l ∂ β = b \frac {\partial l}{\partial \beta} = \mathbf {b} βl=b

  • 标量对矩阵求导同理:等式右侧是 ∑ i , j a i j d X i j \sum_{i,j} a_{ij}\mathrm {d} X_{ij} i,jaijdXij,那么 ∂ l ∂ X i j = a i j \frac {\partial l}{\partial X_{ij}} = a_{ij} Xijl=aij,自然可得 ∂ l ∂ X = A \frac {\partial l}{\partial \mathbf {X}} = \mathbf {A} Xl=A

这借鉴了多元情形下的全微分公式,全微分是梯度向量与微分向量的内积 d f = ∑ i ∂ f ∂ x i d x i = [ ∂ f ∂ x ] T d x \mathrm {d} f = \sum_{i}\frac {\partial f}{\partial x_i}\mathrm {d} x_i = \left [\frac {\partial f}{\partial x}\right]^T\mathrm {d} x df=ixifdxi=[xf]Tdx

了解这个原理后,我们可以发现写成其他形式也是可以的,比如内积:

  • 标量对向量求导,写成 d l = ⟨ b , d β ⟩ \mathrm {d} l = \langle \mathbf {b}, \mathrm {d}\beta \rangle dl=b,dβ,则 ∂ l ∂ β = b \frac {\partial l}{\partial \beta} = \mathbf {b} βl=b

  • 标量对矩阵求导,写成 d l = ⟨ A , d X ⟩ \mathrm {d} l = \langle \mathbf {A}, \mathrm {d}\mathbf {X} \rangle dl=A,dX,则 ∂ l ∂ X = A \frac {\partial l}{\partial \mathbf {X}} = \mathbf {A} Xl=A

注:矩阵的内积是,对应位置相乘,再将所有数相加。内积形式的应用可以参见下一节:哈达马乘积的处理。

哈达马乘积的处理

遇到 ⊙ d y \odot \mathrm {d} y dy 这种情况,还是要努力转化成我们熟知的形式。这里举一个例子,提供三种方法。

题目: l = x T exp ⁡ ( y ) l = \mathbf {x}^T \exp (\mathbf {y}) l=xTexp(y),求 ∂ l ∂ y \frac {\partial l}{\partial \mathbf {y}} yl

1.内积方法 d l = x T [ exp ⁡ ( y ) ⊙ d y ] = ⟨ x , exp ⁡ ( y ) ⊙ d y ⟩ = ⟨ x ⊙ exp ⁡ ( y ) , d y ⟩ \mathrm {d} l = \mathbf {x}^T [\exp (\mathbf {y}) \odot \mathrm {d}\mathbf {y}] = \langle \mathbf {x}, \exp (\mathbf {y}) \odot \mathrm {d}\mathbf {y} \rangle = \langle \mathbf {x} \odot \exp (\mathbf {y}), \mathrm {d}\mathbf {y} \rangle dl=xT[exp(y)dy]=x,exp(y)dy=xexp(y),dy,所以 ∂ l ∂ y = x ⊙ exp ⁡ ( y ) \frac {\partial l}{\partial \mathbf {y}} = \mathbf {x} \odot \exp (\mathbf {y}) yl=xexp(y)。上面最后一个等式是一个性质,也很好理解,只要写成 ∑ i x i exp ⁡ ( y i ) y i \sum_{i} x_{i}\exp (y_{i}) y_{i} ixiexp(yi)yi 即可;当三者都是矩阵时,这条性质也成立。

2.迹的性质:对哈达马乘积,迹也有和上面内积类似的性质:当 A , B , C \mathbf {A}, \mathbf {B}, \mathbf {C} A,B,C 同维度时, tr ( ( A ⊙ B ) T C ) = tr ( A T ( B ⊙ C ) ) \text {tr}((\mathbf {A} \odot \mathbf {B})^T\mathbf {C}) = \text {tr}(\mathbf {A}^T (\mathbf {B} \odot \mathbf {C})) tr((AB)TC)=tr(AT(BC))。如果用这条性质来做的话,就可以直接写出 d l = tr ( [ x ⊙ exp ⁡ ( y ) ] T d y ) \mathrm {d} l = \text {tr}([\mathbf {x} \odot \exp (\mathbf {y})]^T \mathrm {d}\mathbf {y}) dl=tr([xexp(y)]Tdy)

3.矩阵相乘:当出现的是向量的哈达马乘积时,还有第三种做法。令 Z = diag ( y ) \mathbf {Z} = \text {diag}(\mathbf {y}) Z=diag(y),则 d l = x T [ exp ⁡ ( y ) ⊙ d y ] = x T Z d y \mathrm {d} l = \mathbf {x}^T [\exp (\mathbf {y}) \odot \mathrm {d}\mathbf {y}] = \mathbf {x}^T \mathbf {Z} \mathrm {d}\mathbf {y} dl=xT[exp(y)dy]=xTZdy,这就是我们熟知的形式了。

例题

1.标量对向量求导。已知 l = x T A x l = \mathbf {x}^T\mathbf {A}\mathbf {x} l=xTAx,求 ∂ l ∂ x \frac {\partial l}{\partial \mathbf {x}} xl

  • 解法 1:右侧写成标量形式, l = ∑ i j x i a i j x j l = \sum_{i j} x_{i} a_{ij} x_{j} l=ijxiaijxj。对向量中元素逐个求导如下:

    ∂ l ∂ x k = ∑ j ≠ k a k j x j + ∑ i ≠ k x i a i k + 2 a k k x k = ∑ j a k j x j + ∑ i x i a i k = A k , : x + x T A : , k = A k , : x + A k , : T x \frac {\partial l}{\partial x_{k}} = \sum_{j \neq k} a_{kj} x_{j} + \sum_{i \neq k} x_{i} a_{ik} + 2a_{kk} x_{k} = \sum_{j} a_{kj} x_{j} + \sum_{i} x_{i} a_{ik} = \mathbf {A}_{k,:}\mathbf {x} + \mathbf {x}^T\mathbf {A}_{:,k} = \mathbf {A}_{k,:}\mathbf {x} + \mathbf {A}^T_{k,:}\mathbf {x} xkl=j=kakjxj+i=kxiaik+2akkxk=jakjxj+ixiaik=Ak,:x+xTA:,k=Ak,:x+Ak,:Tx

    拼合可得

    ∂ l ∂ x = ( A + A T ) x \frac {\partial l}{\partial \mathbf {x}} = (\mathbf {A} + \mathbf {A}^T)\mathbf {x} xl=(A+AT)x

  • 解法 2:微分法, d l = d [ tr ( x T A x ) ] = tr [ d ( x T A x ) ] = tr [ d ( x T A ) x + x T A d x ] = tr [ x T A T d x + x T A d x ] = tr [ x T ( A T + A ) d x ] \mathrm {d} l = \mathrm {d}[\text {tr}(\mathbf {x}^T\mathbf {A}\mathbf {x})] = \text {tr}[\mathrm {d}(\mathbf {x}^T\mathbf {A}\mathbf {x})] = \text {tr}[\mathrm {d}(\mathbf {x}^T\mathbf {A})\mathbf {x} + \mathbf {x}^T\mathbf {A} \mathrm {d}\mathbf {x}] = \text {tr}[\mathbf {x}^T\mathbf {A}^T \mathrm {d}\mathbf {x} + \mathbf {x}^T\mathbf {A} \mathrm {d}\mathbf {x}] = \text {tr}[\mathbf {x}^T (\mathbf {A}^T + \mathbf {A})\mathrm {d}\mathbf {x}] dl=d[tr(xTAx)]=tr[d(xTAx)]=tr[d(xTA)x+xTAdx]=tr[xTATdx+xTAdx]=tr[xT(AT+A)dx]。因此 ∂ l ∂ x = ( A + A T ) x \frac {\partial l}{\partial \mathbf {x}} = (\mathbf {A} + \mathbf {A}^T)\mathbf {x} xl=(A+AT)x

2.标量对矩阵求导。已知 l = a T X b l = \mathbf {a}^T\mathbf {X}\mathbf {b} l=aTXb,求 ∂ l ∂ X \frac {\partial l}{\partial \mathbf {X}} Xl

  • 解法 1:右侧写成标量形式, l = ∑ i j a i x i j b j l = \sum_{ij} a_{i} x_{ij} b_{j} l=ijaixijbj。对向量中元素逐个
    求导可得 ∂ l ∂ X i j = a i b j \frac {\partial l}{\partial X_{ij}} = a_{i} b_{j} Xijl=aibj。所以 ∂ l ∂ X = a b T \frac {\partial l}{\partial \mathbf {X}} = \mathbf {a}\mathbf {b}^T Xl=abT

  • 解法 2:微分法, d l = d [ tr ( a T X b ) ] = tr [ d ( a T X b ) ] = tr [ a T d ( X b ) ] = tr [ a T d X b ] = tr [ b a T d X ] \mathrm {d} l = \mathrm {d}[\text {tr}(\mathbf {a}^T\mathbf {X}\mathbf {b})] = \text {tr}[\mathrm {d}(\mathbf {a}^T\mathbf {X}\mathbf {b})] = \text {tr}[\mathbf {a}^T\mathrm {d}(\mathbf {X}\mathbf {b})] = \text {tr}[\mathbf {a}^T\mathrm {d}\mathbf {X}\mathbf {b}] = \text {tr}[\mathbf {b}\mathbf {a}^T\mathrm {d}\mathbf {X}] dl=d[tr(aTXb)]=tr[d(aTXb)]=tr[aTd(Xb)]=tr[aTdXb]=tr[baTdX]。因此 ∂ l ∂ X = a b T \frac {\partial l}{\partial \mathbf {X}} = \mathbf {a}\mathbf {b}^T Xl=abT

3.多元正态分布 Σ \Sigma Σ 的极大似然估计,需要计算对数似然对 Σ \Sigma Σ 的导数。

l = log ⁡ ∣ Σ ∣ + 1 N ∑ i = 1 N ( x i − x ‾ ) T Σ − 1 ( x i − x ‾ ) l = \log|\Sigma| + \frac {1}{N}\sum_{i = 1}^{N}(\mathbf {x}_{i} - \overline {\mathbf {x}})^T\Sigma^{-1}(\mathbf {x}_{i} - \overline {\mathbf {x}}) l=log∣Σ∣+N1i=1N(xix)TΣ1(xix)

使用微分法:

d l = 1 ∣ Σ ∣ d ∣ Σ ∣ + 1 N ∑ i = 1 N ( x i − x ‾ ) T d ( Σ − 1 ) ( x i − x ‾ ) = tr ( Σ − 1 d Σ ) − 1 N ∑ i = 1 N ( x i − x ‾ ) T Σ − 1 d Σ Σ − 1 ( x i − x ‾ ) = tr ( Σ − 1 d Σ ) − 1 N ∑ i = 1 N Σ − 1 ( x i − x ‾ ) ( x i − x ‾ ) T Σ − 1 d Σ = tr ( [ Σ − 1 − 1 N ∑ i = 1 N Σ − 1 ( x i − x ‾ ) ( x i − x ‾ ) T Σ − 1 ] d Σ ) = tr ( [ Σ − 1 − Σ − 1 S Σ − 1 ] d Σ ) \begin {aligned} \mathrm {d} l &= \frac {1}{|\Sigma|}\mathrm {d}|\Sigma| + \frac {1}{N}\sum_{i = 1}^{N}(\mathbf {x}_{i} - \overline {\mathbf {x}})^T\mathrm {d}(\Sigma^{-1})(\mathbf {x}_{i} - \overline {\mathbf {x}}) \\ &= \text {tr}(\Sigma^{-1}\mathrm {d}\Sigma) - \frac {1}{N}\sum_{i = 1}^{N}(\mathbf {x}_{i} - \overline {\mathbf {x}})^T\Sigma^{-1}\mathrm {d}\Sigma\Sigma^{-1}(\mathbf {x}_{i} - \overline {\mathbf {x}}) \\ &= \text {tr}(\Sigma^{-1}\mathrm {d}\Sigma) - \frac {1}{N}\sum_{i = 1}^{N}\Sigma^{-1}(\mathbf {x}_{i} - \overline {\mathbf {x}})(\mathbf {x}_{i} - \overline {\mathbf {x}})^T\Sigma^{-1}\mathrm {d}\Sigma \\ &= \text {tr}\left (\left [\Sigma^{-1} - \frac {1}{N}\sum_{i = 1}^{N}\Sigma^{-1}(\mathbf {x}_{i} - \overline {\mathbf {x}})(\mathbf {x}_{i} - \overline {\mathbf {x}})^T\Sigma^{-1}\right]\mathrm {d}\Sigma\right) \\ &= \text {tr}\left (\left [\Sigma^{-1} - \Sigma^{-1} S\Sigma^{-1}\right]\mathrm {d}\Sigma\right) \end {aligned} dl=∣Σ∣1d∣Σ∣+N1i=1N(xix)Td(Σ1)(xix)=tr(Σ1dΣ)N1i=1N(xix)TΣ1dΣΣ1(xix)=tr(Σ1dΣ)N1i=1NΣ1(xix)(xix)TΣ1dΣ=tr([Σ1N1i=1NΣ1(xix)(xix)TΣ1]dΣ)=tr([Σ1Σ1SΣ1]dΣ)

所以

∂ l ∂ Σ = ( Σ − 1 − Σ − 1 S Σ − 1 ) T \frac {\partial l}{\partial \Sigma} = (\Sigma^{-1} - \Sigma^{-1} S\Sigma^{-1})^T Σl=(Σ1Σ1SΣ1)T

向量矩阵间求导

机器学习中常见的是标量对向量或矩阵求导,但如果涉及求二阶导,或者使用链式法则,则需要向量对向量求导,或者矩阵对矩阵求导。

向量对向量求导

向量 y \mathbf {y} y 长度为 m m m,向量 x \mathbf {x} x 长度为 n n n ∂ y ∂ x \frac {\partial \mathbf {y}}{\partial \mathbf {x}} xy 结果有两种写法:

  • 分子布局:得到一个 m × n m \times n m×n 的矩阵,一般叫雅克比矩阵。

  • 分母布局:得到一个 n × m n \times m n×m 的矩阵,一般叫梯度矩阵。

这两者本质相同,只是写法不同,互为转置。上文标量对向量、矩阵求导中使用的是分母布局,因此下文统一也都用分母布局方式,此时

∂ y ∂ x = ( ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 ⋯ ∂ y m ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 ⋯ ∂ y m ∂ x 2 ⋮ ⋮ ⋱ ⋮ ∂ y 1 ∂ x n ∂ y 2 ∂ x n ⋯ ∂ y m ∂ x n ) . \frac {\partial \mathbf {y}}{\partial \mathbf {x}} = \begin {pmatrix} \frac {\partial y_{1}}{\partial x_{1}} & \frac {\partial y_{2}}{\partial x_{1}} & \cdots & \frac {\partial y_{m}}{\partial x_{1}} \\ \frac {\partial y_{1}}{\partial x_{2}} & \frac {\partial y_{2}}{\partial x_{2}} & \cdots & \frac {\partial y_{m}}{\partial x_{2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac {\partial y_{1}}{\partial x_{n}} & \frac {\partial y_{2}}{\partial x_{n}} & \cdots & \frac {\partial y_{m}}{\partial x_{n}} \end {pmatrix}. xy= x1y1x2y1xny1x1y2x2y2xny2x1ymx2ymxnym .

向量对向量求导,只要写成这种形式 d y = [ ∂ y ∂ x ] T d x \mathrm {d}\mathbf {y} = \left [\frac {\partial \mathbf {y}}{\partial \mathbf {x}}\right]^T \mathrm {d}\mathbf {x} dy=[xy]Tdx,这与之前标量求导相比,只是少了一个迹。

矩阵对矩阵求导

矩阵 Y p × q \mathbf {Y}_{p \times q} Yp×q 对矩阵 X m × n \mathbf {X}_{m \times n} Xm×n 求导,需要产生出 p q × m n pq \times mn pq×mn 个值,为了不产生太高维的数组,我们可以将 X , Y \mathbf {X}, \mathbf {Y} X,Y 矩阵都拉成向量,把各列堆起来即可,如下所示:

vec ( X ) = [ X 11 , … , X m 1 , X 12 , … , X m 2 , … , X 1 n , … , X m n ] T ( m n × 1 ) \text {vec}(\mathbf {X}) = [X_{11}, \ldots, X_{m1}, X_{12}, \ldots, X_{m2}, \ldots, X_{1n}, \ldots, X_{mn}]^T \quad (mn \times 1) vec(X)=[X11,,Xm1,X12,,Xm2,,X1n,,Xmn]T(mn×1)

则矩阵对矩阵求导,可转化为向量对向量求导, ∂ Y ∂ X = ∂ vec ( Y ) ∂ vec ( X ) \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} = \frac {\partial \text {vec}(\mathbf {Y})}{\partial \text {vec}(\mathbf {X})} XY=vec(X)vec(Y) (维度为 m n × p q mn \times pq mn×pq ),导数与微分的关系如下:

vec ( d Y ) = [ ∂ Y ∂ X ] T vec ( d X ) \text {vec}(\mathrm {d}\mathbf {Y}) = \left [\frac {\partial \mathbf {Y}}{\partial \mathbf {X}}\right]^T \text {vec}(\mathrm {d}\mathbf {X}) vec(dY)=[XY]Tvec(dX)

所以矩阵对矩阵求导的步骤为:先两侧取微分,然后两侧取 vec,再将 vec ( d X ) \text {vec}(\mathrm {d}\mathbf {X}) vec(dX) 放到最右边即可。这个过程需要用到向量化的性质,以及 Kronecker 积和交换矩阵相关的恒等式。

向量化

  • 线性 vec ( A + B ) = vec ( A ) + vec ( B ) \text {vec}(\mathbf {A} + \mathbf {B}) = \text {vec}(\mathbf {A}) + \text {vec}(\mathbf {B}) vec(A+B)=vec(A)+vec(B)

  • 矩阵乘法 vec ( A X B ) = ( B T ⊗ A ) vec ( X ) \text {vec}(\mathbf {A}\mathbf {X}\mathbf {B}) = (\mathbf {B}^T \otimes \mathbf {A})\text {vec}(\mathbf {X}) vec(AXB)=(BTA)vec(X) vec ( A X ) = vec ( A X I ) = ( I ⊗ A ) vec ( X ) \text {vec}(\mathbf {A}\mathbf {X}) = \text {vec}(\mathbf {A}\mathbf {X}\mathbf {I}) = (\mathbf {I} \otimes \mathbf {A})\text {vec}(\mathbf {X}) vec(AX)=vec(AXI)=(IA)vec(X) ⊗ \otimes 表示 Kronecker 积, A m × n ⊗ B p × q = [ A i j B ] m p × n q \mathbf {A}_{m \times n} \otimes \mathbf {B}_{p \times q} = [\mathbf {A}_{ij}\mathbf {B}]_{mp \times nq} Am×nBp×q=[AijB]mp×nq

  • 转置 vec ( A T ) = K m n vec ( A m × n ) \text {vec}(\mathbf {A}^T) = \mathbf {K}_{mn}\text {vec} (\mathbf {A}_{m \times n}) vec(AT)=Kmnvec(Am×n),其中 K m n \mathbf {K}_{mn} Kmn 是交换矩阵,维度为 m n × m n mn \times mn mn×mn,将按列优先的向量化变成按行优先的向量化。

  • 逐元素乘法 vec ( A ⊙ X ) = diag ( A ) vec ( X ) \text {vec}(\mathbf {A} \odot \mathbf {X}) = \text {diag}(\mathbf {A})\text {vec}(\mathbf {X}) vec(AX)=diag(A)vec(X),其中 diag ( A ) \text {diag}(\mathbf {A}) diag(A) 维度为 m n × m n mn \times mn mn×mn,是 A \mathbf {A} A 中元素按列优先排成的对角阵。

Kronecker 积和交换矩阵相关的恒等式

  • ( A ⊗ B ) T = A T ⊗ B T (\mathbf {A} \otimes \mathbf {B})^T = \mathbf {A}^T \otimes \mathbf {B}^T (AB)T=ATBT

  • vec ( a b T ) = b ⊗ a \text {vec}(\mathbf {ab}^T) = \mathbf {b} \otimes \mathbf {a} vec(abT)=ba

  • ( A ⊗ B ) ( C ⊗ D ) = ( A C ) ⊗ ( B D ) (\mathbf {A} \otimes \mathbf {B})(\mathbf {C} \otimes \mathbf {D}) = (\mathbf {AC}) \otimes (\mathbf {BD}) (AB)(CD)=(AC)(BD)

  • K m n = K n m T \mathbf {K}_{mn} = \mathbf {K}^T_{nm} Kmn=KnmT K m n K n m = I \mathbf {K}_{mn}\mathbf {K}_{nm} = \mathbf {I} KmnKnm=I

  • K p m ( A ⊗ B ) K n q = B ⊗ A \mathbf {K}_{pm}(\mathbf {A} \otimes \mathbf {B})\mathbf {K}_{nq} = \mathbf {B} \otimes \mathbf {A} Kpm(AB)Knq=BA,其中 A \mathbf {A} A 的维度为 m × n m \times n m×n B \mathbf {B} B 的维度是 p × q p \times q p×q

向量对矩阵求导,或者矩阵对向量求导,都是按照矩阵对矩阵求导的方式来做,只不过向量取 vec 是它本身而已。

例题

1.向量对向量求导 y = A x \mathbf {y} = \mathbf {A}\mathbf {x} y=Ax,求 ∂ y ∂ x \frac {\partial \mathbf {y}}{\partial \mathbf {x}} xy

解: d y = A d x \mathrm {d}\mathbf {y} = \mathbf {A}\mathrm {d}\mathbf {x} dy=Adx,所以 ∂ y ∂ x = A T \frac {\partial \mathbf {y}}{\partial \mathbf {x}} = \mathbf {A}^T xy=AT

2.矩阵对矩阵求导 Y = A X \mathbf {Y} = \mathbf {A}\mathbf {X} Y=AX,求 ∂ Y ∂ X \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} XY

解: d Y = A d X \mathrm {d}\mathbf {Y} = \mathbf {A}\mathrm {d}\mathbf {X} dY=AdX,向量化如下:

vec ( d Y ) = vec ( A d X ) = ( I ⊗ A ) vec ( d X ) \text {vec}(\mathrm {d}\mathbf {Y}) = \text {vec}(\mathbf {A}\mathrm {d}\mathbf {X}) = (\mathbf {I} \otimes \mathbf {A})\text {vec}(\mathrm {d}\mathbf {X}) vec(dY)=vec(AdX)=(IA)vec(dX)

所以 ∂ Y ∂ X = I ⊗ A T \frac {\partial \mathbf {Y}}{\partial \mathbf {X}} = \mathbf {I} \otimes \mathbf {A}^T XY=IAT

3.二阶导 f = log ⁡ ∣ X ∣ f = \log|\mathbf {X}| f=logX X \mathbf {X} X 维度为 n × n n \times n n×n,求 ∇ X f \nabla_{\mathbf {X}} f Xf ∇ X 2 f \nabla^2_{\mathbf {X}} f X2f

解:易知 ∇ X f = X − 1 T \nabla_{\mathbf {X}} f = \mathbf {X}^{-1T} Xf=X1T,等式两端同时取微分并向量化可得:

vec ( d ∇ X f ) = vec ( d X − 1 T ) = − vec ( [ X − 1 d X X − 1 ] T ) = − K n n vec ( X − 1 d X X − 1 ) = − K n n ( X − 1 T ⊗ X − 1 ) vec ( d X ) \begin {aligned} \text {vec}(\mathrm {d}\nabla_{\mathbf {X}} f) &= \text {vec}(\mathrm {d}\mathbf {X}^{-1T}) \\ &= -\text {vec}([\mathbf {X}^{-1}\mathrm {d}\mathbf {X} \mathbf {X}^{-1}]^T) \\ &= -\mathbf {K}_{nn}\text {vec}(\mathbf {X}^{-1}\mathrm {d}\mathbf {X} \mathbf {X}^{-1}) \\ &= -\mathbf {K}_{nn}(\mathbf {X}^{-1T} \otimes \mathbf {X}^{-1})\text {vec}(\mathrm {d}\mathbf {X}) \end {aligned} vec(dXf)=vec(dX1T)=vec([X1dXX1]T)=Knnvec(X1dXX1)=Knn(X1TX1)vec(dX)

因此 ∇ X 2 f = − K n n ( X − 1 T ⊗ X − 1 ) \nabla^2_{\mathbf {X}} f = -\mathbf {K}_{nn}(\mathbf {X}^{-1T} \otimes \mathbf {X}^{-1}) X2f=Knn(X1TX1),这是个对称矩阵。当 X \mathbf {X} X 是对称矩阵时, ∇ X 2 f = X − 1 ⊗ X − 1 \nabla^2_{\mathbf {X}} f = \mathbf {X}^{-1} \otimes \mathbf {X}^{-1} X2f=X1X1

4.逐元素函数 F = A exp ⁡ ( X B ) \mathbf {F} = \mathbf {A}\exp (\mathbf {X}\mathbf {B}) F=Aexp(XB),各矩阵维度分别为 A l × m \mathbf {A}_{l \times m} Al×m X m × n \mathbf {X}_{m \times n} Xm×n B n × p \mathbf {B}_{n \times p} Bn×p,求 ∂ F ∂ X \frac {\partial \mathbf {F}}{\partial \mathbf {X}} XF

解:等式两端同时取微分并向量化可得:

vec ( d F ) = vec ( d A exp ⁡ ( X B ) ) = vec ( A [ exp ⁡ ( X B ) ⊙ d ( X B ) ] ) = ( I p ⊗ A ) vec ( [ exp ⁡ ( X B ) ⊙ d ( X B ) ] ) = ( I p ⊗ A ) diag ( exp ⁡ ( X B ) ) vec ( d ( X B ) ) = ( I p ⊗ A ) diag ( exp ⁡ ( X B ) ) ( B T ⊗ I m ) vec ( d X ) \begin {aligned} \text {vec}(\mathrm {d}\mathbf {F}) &= \text {vec}(\mathrm {d}\mathbf {A}\exp (\mathbf {X}\mathbf {B})) \\ &= \text {vec}(\mathbf {A}[\exp (\mathbf {X}\mathbf {B}) \odot \mathrm {d}(\mathbf {X}\mathbf {B})]) \\ &= (\mathbf {I}_p \otimes \mathbf {A})\text {vec}([\exp (\mathbf {X}\mathbf {B}) \odot \mathrm {d}(\mathbf {X}\mathbf {B})]) \\ &= (\mathbf {I}_p \otimes \mathbf {A})\text {diag}(\exp (\mathbf {X}\mathbf {B}))\text {vec}(\mathrm {d}(\mathbf {X}\mathbf {B})) \\ &= (\mathbf {I}_p \otimes \mathbf {A})\text {diag}(\exp (\mathbf {X}\mathbf {B}))(\mathbf {B}^T \otimes \mathbf {I}_m)\text {vec}(\mathrm {d}\mathbf {X}) \end {aligned} vec(dF)=vec(dAexp(XB))=vec(A[exp(XB)d(XB)])=(IpA)vec([exp(XB)d(XB)])=(IpA)diag(exp(XB))vec(d(XB))=(IpA)diag(exp(XB))(BTIm)vec(dX)

因此

∂ F ∂ X = ( B ⊗ I m ) diag ( exp ⁡ ( X B ) ) ( I p ⊗ A T ) \frac {\partial \mathbf {F}}{\partial \mathbf {X}} = (\mathbf {B} \otimes \mathbf {I}_m)\text {diag}(\exp (\mathbf {X}\mathbf {B}))(\mathbf {I}_p \otimes \mathbf {A}^T) XF=(BIm)diag(exp(XB))(IpAT)

链式法则和更多例题请见下一篇 Ref 2。


Ref 0 / 2

via:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值