矩阵微分(一)
标准梯度公式
自变量是标量
Df(x)=limt→0f(x+t)−f(x)t Df(x) = \lim _{t\to 0} \frac {f(x+t)-f(x)}{t} Df(x)=t→0limtf(x+t)−f(x)
自变量是向量
Dwf(x)=limt→0f(x+tw)−f(x)t D_{\textbf {w}}f(\textbf {x}) = \lim _{t\to 0} \frac {f(\textbf {x} + t\textbf {w}) - f(\textbf {x})}{t} Dwf(x)=t→0limtf(x+tw)−f(x)
自变量是矩阵
DWf(X)=limt→0f(X+tW)−f(X)t D_{\textbf {W}}f(\textbf {X}) = \lim _{t\to 0} \frac {f(\textbf {X}+t\textbf {W})-f(\textbf {X})}{t} DWf(X)=t→0limtf(X+tW)−f(X)
矩阵迹的性质
性质1
tr(A)=tr(AT) tr(A) = tr(A^{T}) tr(A)=tr(AT)
性质2
tr(AB)=tr(BA) tr(AB) = tr(BA) tr(AB)=tr(BA)
tr(AB)=tr(BA) tr(AB) = tr(BA) tr(AB)=tr(BA)
tr(ABCD)=tr(DABC)=tr(CDAB)=tr(BCDA) tr(ABCD) = tr(DABC) = tr(CDAB) = tr(BCDA) tr(ABCD)=tr(DABC)=tr(CDAB)=tr(BCDA)
性质3
tr(A+B)=tr(A)+tr(B) tr(A+B) = tr(A) + tr(B) tr(A+B)=tr(A)+tr(B)
性质4
tr(αA)=αtr(A) tr(\alpha A) = \alpha tr(A) tr(αA)=αtr(A)
性质5
设有矩阵H、U,H和U都是n x m的矩阵,则有:
∑j=1m∑i=1n(hijuij)=∑j=1m∑i=1n((hT)jiuij)=tr(HTU) \sum _{j=1}^{m} \sum _{i=1}^{n}(h_{ij}u_{ij}) = \sum _{j=1}^{m} \sum _{i=1}^{n}((h^{T})_{ji}u_{ij}) = tr(H^{T}U) j=1∑mi=1∑n(hijuij)=j=1∑mi=1∑n((hT)jiuij)=tr(HTU)
矩阵微分的性质
设有关于矩阵A的一个函数f,记为f(A),f(A)关于A的导数为:
∇Af(A)=∂f(A)∂A=[∂f∂A11∂f∂A12⋯∂f∂A1n∂f∂A21∂f∂A22⋯∂f∂A2n⋮⋮⋱⋮∂f∂Am1∂f∂Am2⋯∂f∂Amn] \nabla _{A}f(A) = \frac { \partial f(A) }{ \partial A } = \left[ \begin{matrix} \frac {\partial f }{\partial A_{11}}&\frac {\partial f }{\partial A_{12}}&\cdots &\frac {\partial f }{\partial A_{1n}}\\ \frac {\partial f }{\partial A_{21}}&\frac {\partial f }{\partial A_{22}}&\cdots &\frac {\partial f }{\partial A_{2n}}\\ \vdots &\vdots &\ddots &\vdots \\ \frac {\partial f }{\partial A_{m1}}&\frac {\partial f }{\partial A_{m2}}&\cdots &\frac {\partial f }{\partial A_{mn}}\\ \end{matrix} \right] ∇Af(A)=∂A∂f(A)=⎣⎢⎢⎢⎢⎡∂A11∂f∂A21∂f⋮∂Am1∂f∂A12∂f∂A22∂f⋮∂Am2∂f⋯⋯⋱⋯∂A1n∂f∂A2n∂f⋮∂Amn∂f⎦⎥⎥⎥⎥⎤
性质1
∇ATf(A)=(∇Af(A))T \nabla _{ A^{T} }f(A) = (\nabla _{A}f(A))^{T} ∇ATf(A)=(∇Af(A))T
证明
∇ATf(A)=[∂f∂A11∂f∂A21⋯∂f∂Am1∂f∂A12∂f∂A22⋯∂f∂Am2⋮⋮⋱⋮∂f∂A1n∂f∂A2n⋯∂f∂Amn]=[∂f∂A11∂f∂A12⋯∂f∂A1n∂f∂A21∂f∂A22⋯∂f∂A2n⋮⋮⋱⋮∂f∂Am1∂f∂Am2⋯∂f∂Amn]T=(∂f(A)∂A)T=(∇Af(A))T \nabla _{ A^{T} }f(A) = \left[ \begin{matrix} \frac {\partial f }{\partial A_{11}}&\frac {\partial f }{\partial A_{21}}&\cdots &\frac {\partial f }{\partial A_{m1}}\\ \frac {\partial f }{\partial A_{12}}&\frac {\partial f }{\partial A_{22}}&\cdots &\frac {\partial f }{\partial A_{m2}}\\ \vdots &\vdots &\ddots &\vdots \\ \frac {\partial f }{\partial A_{1n}}&\frac {\partial f }{\partial A_{2n}}&\cdots &\frac {\partial f }{\partial A_{mn}}\\ \end{matrix} \right] = \left[ \begin{matrix} \frac {\partial f }{\partial A_{11}}&\frac {\partial f }{\partial A_{12}}&\cdots &\frac {\partial f }{\partial A_{1n}}\\ \frac {\partial f }{\partial A_{21}}&\frac {\partial f }{\partial A_{22}}&\cdots &\frac {\partial f }{\partial A_{2n}}\\ \vdots &\vdots &\ddots &\vdots \\ \frac {\partial f }{\partial A_{m1}}&\frac {\partial f }{\partial A_{m2}}&\cdots &\frac {\partial f }{\partial A_{mn}}\\ \end{matrix} \right]^{T} = (\frac { \partial f(A) }{ \partial A })^{T} = (\nabla _{A}f(A))^{T} ∇ATf(A)=⎣⎢⎢⎢⎢⎡∂A11∂f∂A12∂f⋮∂A1n∂f∂A21∂f∂A22∂f⋮∂A2n∂f⋯⋯⋱⋯∂Am1∂f∂Am2∂f⋮∂Amn∂f⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡∂A11∂f∂A21∂f⋮∂Am1∂f∂A12∂f∂A22∂f⋮∂Am2∂f⋯⋯⋱⋯∂A1n∂f∂A2n∂f⋮∂Amn∂f⎦⎥⎥⎥⎥⎤T=(∂A∂f(A))T=(∇Af(A))T
性质2
假设存在矩阵U,使得下面的等式成立:
DWf(X)=limt→0f(X+tW)−f(X)t=tr(WTU) D_{\textbf {W}}f(\textbf {X}) = \lim _{t\to 0} \frac {f(\textbf {X}+t\textbf {W})-f(\textbf {X})}{t} = tr(W^{T}U) DWf(X)=t→0limtf(X+tW)−f(X)=tr(WTU)
当f(X)是一个tr运算的时候可能成立
那么,对W中任意一个Wij求导,则有:
DWijf(X)=tr(WijTU)=∑j=1∑i=1(wijuij)=uij D_{W_{ij}}f(\textbf {X}) = tr(W_{ij}^{T}U) = \sum _{j=1}^{} \sum _{i=1}^{}(w_{ij}u_{ij}) = u_{ij} DWijf(X)=tr(WijTU)=j=1∑i=1∑(wijuij)=uij
对W矩阵的局部单个元素求导,其实按偏导数的概念理解即可,既然是偏导数,这就意味着除了存在w_{ij}的那一项之外的其他元素都被当做常数,而对常数求导必然等于0,所以最后会得到唯一的u_{ij}。
对局部的Wij求导会得到Uij,那么分别对所有Wij求导,并把各个求导结果再组成一个矩阵,就是U矩阵了。又因为W代表任意矩阵,所以f(X)关于X的导数等于U:
∂f(X)∂X=U
\frac { \partial f(\textbf {X}) }{ \partial X } = \textbf {U}
∂X∂f(X)=U
这个式子的意义在于,当题目是“给你一个自变量是矩阵X的函数f(X),求它关于X的导数”时,可以把问题立即转变成求U,而U的求解,可以通过上面的标准导数公式来求。小结一下步骤:
-
计算limt→0f(X+tW)−f(X)t,并化简,直到得到一个形如tr(WTQ)的式子计算\lim _{t\to 0} \frac {f(\textbf {X}+t\textbf {W})-f(\textbf {X})}{t},并化简,直到得到一个形如 tr(W^{T}Q)的式子计算limt→0tf(X+tW)−f(X),并化简,直到得到一个形如tr(WTQ)的式子
-
根据∂f(X)∂X=U可以得到tr(WTQ)=tr(WTU),于是就得到了∂f(X)∂X=U=Q。根据\frac { \partial f(\textbf {X}) }{ \partial X } = \textbf {U}可以得到 tr(W^{T}Q)=tr(W^{T}U),于是就得到了\frac { \partial f(\textbf {X}) }{ \partial X } = U = Q。根据∂X∂f(X)=U可以得到tr(WTQ)=tr(WTU),于是就得到了∂X∂f(X)=U=Q。
性质3
∂tr(AX)∂X=AT \frac { \partial tr(AX) }{ \partial X } = A^{T} ∂X∂tr(AX)=AT
证明
设:
f(X)=tr(AX)
f(X) = tr(AX)
f(X)=tr(AX)
根据上面的结论,只需要把下面这个极限简化,理论上就可以求出了:∂tr(AX)∂X根据上面的结论,只需要把下面这个极限简化,理论上就可以求出 了:\frac { \partial tr(AX) }{ \partial X }根据上面的结论,只需要把下面这个极限简化,理论上就可以求出了:∂X∂tr(AX)
DWf(X)=limt→0f(X+tW)−f(X)t
D_{\textbf {W}}f(\textbf {X}) = \lim _{t\to 0} \frac {f(\textbf {X}+t\textbf {W})-f(\textbf {X})}{t}
DWf(X)=t→0limtf(X+tW)−f(X)
=limt→0tr(A(X+tW))−tr(AX)t = \lim _{t\to 0} \frac { tr(A(X + tW)) - tr(AX) }{t} =t→0limttr(A(X+tW))−tr(AX)
=limt→0tr(AX+AtW)−tr(AX)t = \lim _{t\to 0} \frac { tr(AX + AtW) - tr(AX) }{t} =t→0limttr(AX+AtW)−tr(AX)
=limt→0tr(AX)+tr(AtW)−tr(AX)t = \lim _{t\to 0} \frac { tr(AX) + tr(AtW) - tr(AX) }{t} =t→0limttr(AX)+tr(AtW)−tr(AX)
=limt→0tr(AtW)t = \lim _{t\to 0} \frac { tr(AtW) }{t} =t→0limttr(AtW)
=limt→0tr(AW)tt = \lim _{t\to 0} \frac { tr(AW)t }{t} =t→0limttr(AW)t
=limt→0tr(AW) = \lim _{t\to 0} tr(AW) =t→0limtr(AW)
=tr(AW) = tr(AW) =tr(AW)
=tr((AW)T) = tr((AW)^{T}) =tr((AW)T)
=tr(WTAT) = tr(W^{T}A^{T}) =tr(WTAT)
所以有:
DWf(X)=tr(WTAT)=tr(WTU)
D_{W}f(X) = tr(W^{T}A^{T}) = tr(W^{T}U)
DWf(X)=tr(WTAT)=tr(WTU)
U=AT U = A^{T} U=AT
得证:
∂tr(AX)∂X=U=AT
\frac { \partial tr(AX) }{ \partial X } = U = A^{T}
∂X∂tr(AX)=U=AT
性质4
∂tr(XTAT)∂X=AT \frac { \partial tr(X^{T}A^{T}) }{ \partial X } = A^{T} ∂X∂tr(XTAT)=AT
性质5
∇Xtr(X)=tr(∇XX) \nabla _{ X}tr(X) = tr(\nabla _{ X}X) ∇Xtr(X)=tr(∇XX)
性质6
∇Xtr(AXBXTC)=ATCTXBT+CAXB \nabla _{ X}tr(AXBX^{T}C) = A^{T}C^{T}XB^{T} + CAXB ∇Xtr(AXBXTC)=ATCTXBT+CAXB
性质7
∇Xtr(XBXTC)=CTXBT+CXB \nabla _{ X}tr(XBX^{T}C) = C^{T}XB^{T} + CXB ∇Xtr(XBXTC)=CTXBT+CXB