矩阵求导的技术,在统计学、控制论、机器学习等领域有广泛的应用。本文来做个科普,分作两篇,上篇讲标量对矩阵的求导术,下篇讲矩阵对矩阵的求导术。本文使用小写字母xxx表示标量,粗体小写字母x\boldsymbol{x}x表示(列)向量,大写字母XXX表示矩阵。
首先来琢磨一下定义,标量fff对矩阵XXX的导数,定义为∂f∂X=[∂f∂Xij]\frac{\partial f}{\partial X}=\left [ \frac{\partial f}{\partial X_{ij}} \right ]∂X∂f=[∂Xij∂f],即fff对XXX逐元素求导排成与XXX尺寸相同的矩阵。然而,这个定义在计算中并不好用,实用上的原因是对函数较复杂的情形难以逐元素求导;哲理上的原因是逐元素求导破坏了整体性。试想,为何要将fff看做矩阵XXX而不是各元素XijX_{ij}Xij的函数呢?答案是用矩阵运算更整洁。所以在求导时不宜拆开矩阵,而是要找一个从整体出发的算法。
为此,我们来回顾,一元微积分中的导数(标量对标量的导数)与微分有联系:df=f′(x)dxdf={f}'(x)dxdf=f′(x)dx;多元微积分中的梯度(标量对向量的导数)也与微分有联系:df=∑i=1n∂f∂xidxi=∂fT∂xdxd f=\sum_{i=1}^n \frac{\partial f}{\partial x_i}dx_i=\frac{\partial f^T}{\partial \boldsymbol x}d\boldsymbol xdf=∑i=1n∂xi∂fdxi=∂x∂fTdx,这里第一个等号是全微分公式,第二个等号表达了梯度与微分的联系:全微分dfdfdf是梯度向量∂f∂x(n×1)\frac{\partial f}{\partial \boldsymbol x}(n\times 1)∂x∂f(n×1)与微分向量dx(x×1)d \boldsymbol x(x \times 1)dx(x×1)的内积;受此启发,我们将矩阵导数与微分建立联系:df=∑i=1m∑j=1n∂f∂XijdXij=tr(∂fT∂XdX)df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial X_{ij}} dX_{ij}=tr\left ( \frac{\partial f^T}{\partial X} dX \right )df=∑i=1m∑j=1n∂Xij∂fdXij=tr(∂X∂fTdX)。其中tr代表迹(trace)是方阵对角线元素之和,满足性质:对尺寸相同的矩阵A,B,tr(ATB)=∑i,jAijBijtr\left ( A^TB \right )=\sum_{i,j}A_{ij}B_{ij}tr(ATB)=∑i,jAijBij,即tr(ATB)tr\left ( A^TB \right )tr(ATB)是矩阵A,B的内积。与梯度相似,这里第一个等号是全微分公式,第二个等号表达了矩阵导数与微分的联系:全微分dfdfdf是导数∂f∂X(m×n)\frac{\partial f}{\partial X}(m\times n)∂X∂f(m×n)与微分矩阵dX(m×n)dX(m\times n)dX(m×n)的内积。
然后来建立运算法则。回想遇到较复杂的一元函数如f=log(2+sin x)exf=log(2+sin \ x)e^{\sqrt{x}}f=log(2+sin x)ex,我们是如何求导的呢?通常不是从定义开始求极限,而是先建立了初等函数求导和四则运算、复合等法则,再来运用这些法则。故而,我们来创立常用的矩阵微分的运算法则:
1.
加减法:d(X±Y)=dX±dYd(X\pm Y)=dX\pm dYd(X±Y)=dX±dY
矩阵乘法:d(XY)=(dX)Y+X(dY)d(XY)=(dX)Y+X(dY)d(XY)=(dX)Y+X(dY)
转置:d(XT)=(dX)Td(X^T)=(dX)^Td(XT)=(dX)T
迹:dtr(X)=tr(dX)dtr(X)=tr(dX)dtr(X)=tr(dX)
2.
逆:dX−1=−X−1dXX−1dX^{-1}=-X^{-1}dXX^{-1}dX−1=−X−1dXX−1。此式可在XX−1=IXX^{-1}=IXX−1=I两侧求微分来证明。
3.
行列式:d∣X∣=tr(X#dX)d\left | X \right |=tr(X^{\#}dX)d∣X∣=tr(X#dX),其中X#X^{\#}X#表示XXX的伴随矩阵,在XXX可逆时又可以写作d∣X∣=∣X∣tr(X−1dX)d\left | X \right |=\left | X \right |tr\left ( X^{-1}dX \right )d∣X∣=∣X∣tr(X−1dX)。此式可用Laplace展开来证明,详见张贤达《矩阵分析与应用》第279页。
4.
逐元素乘法:d(X⊙Y)=dX⊙Y+X⊙dYd(X\odot Y)=dX \odot Y + X \odot dYd(X⊙Y)=dX⊙Y+X⊙dY,⊙\odot⊙表示尺寸相同的矩阵X,Y逐元素相乘。
5.
逐元素函数:dσ(X)=σ′(X)⊙dX, σ(X)=[σ(Xij)]d \sigma (X)= \sigma'(X) \odot dX,\ \sigma (X)=\left [ \sigma (X_{ij}) \right ]dσ(X)=σ′(X)⊙dX, σ(X)=[σ(Xij)],是逐元素标量函数运算,σ′(X)=[σ′(Xij)]{\sigma }'(X)=\left [ {\sigma }'(X_{ij}) \right ]σ′(X)=[σ′(Xij)]是逐元素求导数。举个例子:
X=[x11x12x21x22], dsin(X)=[cos x11dx11cos x12dx12cos x21dx21cos x22dx22]=cos(X)⊙dXX=\begin{bmatrix} x_{11} & x_{12}\\ x_{21} & x_{22} \end{bmatrix},\ d\sin(X)=\begin{bmatrix} cos \ x_{11}dx_{11} &cos \ x_{12}dx_{12} \\ cos \ x_{21}dx_{21} &cos \ x_{22}dx_{22} \end{bmatrix}=cos(X) \odot dXX=[x11x21x12x22], dsin(X)=[cos x11dx11cos x21dx21cos x12dx12cos x22dx22]=cos(X)⊙dX
我们试图利用矩阵导数与微分的联系df=tr(∂fT∂XdX)df = tr\left ( \frac{\partial f^T}{\partial X} dX\right )df=tr(∂X∂fTdX),在求出左侧的微分dfdfdf后,该如何写成右侧的形式并得到导数呢?这需要一些迹技巧(trace trick):
1.标量套上迹:a=tr(a)a=tr(a)a=tr(a)
2.转置:tr(AT)=tr(A)tr(A^T)=tr(A)tr(AT)=tr(A)
3.线性:tr(A±B)=tr(A)±tr(B)tr(A\pm B)=tr(A)\pm tr(B)tr(A±B)=tr(A)±tr(B)
4.矩阵乘法交换:tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA),其中AAA与BTB^TBT尺寸相同。两侧都等于∑ijAijBji\sum_{ij}A_{ij}B_{ji}∑ijAijBji。
5.矩阵乘法/逐元素乘法交换:tr(AT(B⊙C))=tr((A⊙B)TC)tr(A^T(B \odot C))=tr((A \odot B)^TC)tr(AT(B⊙C))=tr((A⊙B)TC)其中尺寸相同。两侧都等于∑i,jAijBijCij\sum_{i,j}A_{ij}B_{ij}C_{ij}∑i,jAijBijCij
观察一下可以断言,若标量函数fff是矩阵XXX经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对fff求微分,再使用迹技巧给dfdfdf套上迹并将其它项交换至dXdXdX左侧,即能得到导数。
在建立法则的最后,来谈一谈复合:假设已求得∂f∂Y\frac{\partial f}{\partial Y}∂Y∂f,而Y是X的函数,如何求∂f∂X\frac{\partial f}{\partial X}∂X∂f呢?在微积分中有标量求导的链式法则∂f∂x=∂f∂y∂y∂x\frac{\partial f}{\partial x}=\frac{\partial f}{\partial y}\frac{\partial y}{\partial x}∂x∂f=∂y∂f∂x∂y,但这里我们不能沿用链式法则,因为矩阵对矩阵的导数∂Y∂X\frac{\partial Y}{\partial X}∂X∂Y截至目前仍是未定义的。于是我们继续追本溯源,链式法则是从何而来?源头仍然是微分。我们直接从微分入手建立复合法则:先写出df=tr(∂fT∂YdY)df=tr(\frac{\partial f^T}{\partial Y}dY)df=tr(∂Y∂fTdY)再将dYdYdY用dXdXdX表示出来代入,并使用迹技巧将其他项交换至dXdXdX左侧,即可得到∂f∂X\frac{\partial f}{\partial X}∂X∂f。
接下来演示一些算例。特别提醒要依据已经建立的运算法则来计算,不能随意套用微积分中标量导数的结论,比如认为AXAXAX对XXX的导数为AAA,这是没有根据、意义不明的。
例1:f=aTXbf=\boldsymbol a^TX \boldsymbol bf=aTXb,求∂f∂X\frac{\partial f}{\partial X}∂X∂f。其中a\boldsymbol aa是m×1m \times 1m×1列向量,XXX是m×nm \times nm×n矩阵,b\boldsymbol bb是n×1n \times 1n×1列向量, fff是标量。
解:先使用矩阵乘法法则求微分,这里的a,b\boldsymbol a,\boldsymbol ba,b是常量,da=0,db=0d\boldsymbol a=0,d\boldsymbol b=0da=0,db=0,得到:df=aTdXbdf=\boldsymbol a^TdX \boldsymbol bdf=aTdXb,再套上迹并做矩阵乘法交换:df=tr(aTdXb)=tr(baTdX)df=tr(\boldsymbol a^TdX\boldsymbol b)=tr(\boldsymbol b \boldsymbol a^TdX)df=tr(aTdXb)=tr(baTdX),注意这里我们根据tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA)交换了aTdX\boldsymbol a^TdXaTdX与b\boldsymbol bb。对照导数与微分的联系df=tr(∂fT∂XdX)df=tr\left ( \frac{\partial f^T}{\partial X} dX \right )df=tr(∂X∂fTdX),得到∂f∂X=(baT)T=abT\frac{\partial f}{\partial X}=(\boldsymbol b \boldsymbol a^T)^T=\boldsymbol a \boldsymbol b^T∂X∂f=(baT)T=abT。
注意:这里不能用∂f∂X=aT∂X∂Xb=?\frac{\partial f}{\partial X}=\boldsymbol a^T\frac{\partial X}{\partial X}\boldsymbol b=?∂X∂f=aT∂X∂Xb=?,导数与乘常数矩阵的交换是不合法则的运算(而微分是合法的)。有些资料在计算矩阵导数时,会略过求微分这一步,这是逻辑上解释不通的。
例2:f=aTexp(Xb)f= \boldsymbol a^Texp(X \boldsymbol b)f=aTexp(Xb),求∂f∂X\frac{\partial f}{\partial X}∂X∂f。其中a\boldsymbol aa是m×1m \times 1m×1列向量,其中XXX是m×nm \times nm×n矩阵,其中b\boldsymbol bb是n×1n \times 1n×1列向量,expexpexp表示逐元素求指数,fff是标量。
解:先使用矩阵乘法、逐元素函数法则求微分:df=aT(exp(Xb))⊙(dXb)df=\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)df=aT(exp(Xb))⊙(dXb),再套上迹并做交换:df=tr(aT(exp(Xb))⊙(dXb)))=tr((a⊙exp(Xb))TdXb)=tr(b(a⊙exp(Xb))TdX)df=tr(\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)))=tr((\boldsymbol a \odot exp(X\boldsymbol b))^TdX\boldsymbol b)=tr(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^TdX)df=tr(aT(exp(Xb))⊙(dXb)))=tr((a⊙exp(Xb))TdXb)=tr(b(a⊙exp(Xb))TdX),注意这里我们先根据tr(AT(B⊙C))=tr((A⊙B)TC)tr(A^T(B \odot C))=tr((A \odot B)^T C)tr(AT(B⊙C))=tr((A⊙B)TC)交换了a,exp(Xb)\boldsymbol a,exp(X\boldsymbol b)a,exp(Xb)与dXbdX\boldsymbol bdXb,再根据tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA)交换了(a⊙exp(Xb))TdX(\boldsymbol a \odot exp(X\boldsymbol b))^TdX(a⊙exp(Xb))TdX与b\boldsymbol bb。对照导数与微分的联系df=tr(∂fT∂XdX)df=tr\left ( \frac{\partial f^T}{\partial X} dX\right )df=tr(∂X∂fTdX),得到∂f∂X=(b(a⊙exp(Xb))T)T=(a⊙exp(Xb))bT\frac{\partial f}{\partial X}=(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^T)^T=(\boldsymbol a \odot exp(X\boldsymbol b))\boldsymbol b^T∂X∂f=(b(a⊙exp(Xb))T)T=(a⊙exp(Xb))bT
例3:f=tr(YTMY),Y=σ(WX)f=tr(Y^TMY),Y=\sigma(WX)f=tr(YTMY),Y=σ(WX)求∂f∂X\frac{\partial f}{\partial X}∂X∂f其中WWW是l×ml \times ml×m矩阵,其中XXX是m×nm \times nm×n矩阵,其中YYY是l×nl \times nl×n矩阵,其中MMM是l×ll \times ll×l矩阵,σ\sigmaσ是逐元素函数,fff是标量。
解:先求∂f∂Y\frac{\partial f}{\partial Y}∂Y∂f,求微分,使用矩阵乘法、转置法则:df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY)df=tr((dY)^TMY)+tr(Y^TMdY)=tr(Y^TM^TdY)+tr(Y^TMdY)=tr(Y^T(M+M^T)dY)df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY),对照导数与微分的联系,得到∂f∂Y=(M+MT)Y=2MY\frac{\partial f}{\partial Y}=(M+M^T)Y=2MY∂Y∂f=(M+MT)Y=2MY,这里是对称矩阵。为求∂f∂X\frac{\partial f}{\partial X}∂X∂f,写出df=tr(∂fT∂YdY)df=tr\left ( \frac{\partial f^T}{\partial Y} dY \right )df=tr(∂Y∂fTdY),再将dYdYdY用dXdXdX表示出来代入,并使用矩阵乘法/逐元素乘法交换:df=tr(∂fT∂Y(σ′(WX)⊙(WdX))=tr((∂f∂Y⊙σ′(WX))TWdX)df=tr\left ( \frac{\partial f^T}{\partial Y}({\sigma}'(WX) \odot (WdX) \right )=tr\left ( {\left ( \frac{\partial f}{\partial Y} \odot {\sigma}'(WX) \right )}^T WdX \right )df=tr(∂Y∂fT(σ′(WX)⊙(WdX))=tr((∂Y∂f⊙σ′(WX))TWdX),对照导数与微分的联系,得到∂f∂X=WT(∂f∂Y⊙σ′(WX))=WT((2Mσ(WXX))⊙σ′(WX))\frac{\partial f}{\partial X}=W^T\left ( \frac{\partial f}{\partial Y} \odot {\sigma}' (WX)\right )=W^T((2M\sigma(WXX)) \odot {\sigma}'(WX))∂X∂f=WT(∂Y∂f⊙σ′(WX))=WT((2Mσ(WXX))⊙σ′(WX))
例4【线性回归】:l=∥Xω−y∥2l=\left \| X \boldsymbol \omega - \boldsymbol y \right \|^2l=∥Xω−y∥2, 求ω\boldsymbol \omegaω的最小二乘估计,即求∂l∂ω\frac{\partial l}{\partial \boldsymbol \omega }∂ω∂l的零点。其中y\boldsymbol yy是m×1m \times 1m×1列向量,XXX是m×nm \times nm×n矩阵,ω\boldsymbol \omegaω是n×1n \times 1n×1列向量,lll是标量。
解:严格来说这是标量对向量的导数,不过可以把向量看做矩阵的特例。先将向量模平方改写成向量与自身的内积:l=(Xω−y)T(Xω−y)l=(X\boldsymbol \omega - \boldsymbol y)^T(X\boldsymbol \omega - \boldsymbol y)l=(Xω−y)T(Xω−y),求微分,使用矩阵乘法、转置等法则:dl=(Xdω)T(Xω−y)+(Xω−y)T(Xdω)=2(Xω−y)TXdωdl=(Xd\boldsymbol \omega )^T(X\boldsymbol \omega - \boldsymbol y)+(X\boldsymbol \omega - \boldsymbol y)^T(Xd\boldsymbol \omega )=2(X\boldsymbol \omega - \boldsymbol y)^TXd\boldsymbol \omegadl=(Xdω)T(Xω−y)+(Xω−y)T(Xdω)=2(Xω−y)TXdω。对照导数与微分的联系dl=∂lT∂ωdωdl=\frac{\partial l^T}{\partial \boldsymbol \omega }d\boldsymbol \omegadl=∂ω∂lTdω,得到∂l∂ω=(2(Xω−y)TX)T=2XT(Xω−y)\frac{\partial l}{\partial \boldsymbol \omega }=(2(X\boldsymbol \omega - \boldsymbol y)^TX)^T=2X^T(X\boldsymbol \omega - \boldsymbol y)∂ω∂l=(2(Xω−y)TX)T=2XT(Xω−y)。∂l∂ω\frac{\partial l}{\partial \boldsymbol \omega }∂ω∂l零点即ω\boldsymbol \omegaω的最小二乘估计为ω=(XTX)−1XTy\boldsymbol \omega =(X^TX)^{-1}X^T\boldsymbol yω=(XTX)−1XTy。
例5【方差的最大似然估计】:样本x1,...,xN∼λ(μ,Σ)\boldsymbol x_1,...,\boldsymbol x_N\sim \lambda (\boldsymbol \mu , \Sigma )x1,...,xN∼λ(μ,Σ),求方差Σ\SigmaΣ的最大似然估计。写成数学式是:l=log∣Σ∣+1N∑i=1N(xi−xˉ)TΣ−1(xi−xˉ)l=log\left | \Sigma \right |+\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T\Sigma ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)l=log∣Σ∣+N1∑i=1N(xi−xˉ)TΣ−1(xi−xˉ),求∂l∂Σ\frac{\partial l}{\partial \Sigma }∂Σ∂l的零点。其中xi\boldsymbol x_{i}xi是m×1m \times 1m×1向量,Σ\SigmaΣ是m×mm \times mm×m对称正定矩阵,xˉ=1N∑i=1Nxi\bar \boldsymbol x= \frac{1}{N} \sum_{i=1}^{N}\boldsymbol x_ixˉ=N1∑i=1Nxi是样本均值,lll是标量,log表示自然对数。
解:首先求微分,使用矩阵乘法、行列式、逆等运算法则,第一项是dlog∣Σ∣=∣Σ∣−1d∣Σ∣=tr(∣Σ∣−1dΣ)dlog\left | \Sigma \right |=\left | \Sigma \right |^{-1}d\left | \Sigma \right |=tr(\left | \Sigma \right |^{-1}d\Sigma )dlog∣Σ∣=∣Σ∣−1d∣Σ∣=tr(∣Σ∣−1dΣ),第二项是1N∑i=1N(xi−xˉ)TdΣ−1(xi−xˉ)=−1N∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ)\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^Td{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)= - \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)N1∑i=1N(xi−xˉ)TdΣ−1(xi−xˉ)=−N1∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ)。再给第二项套上迹做交换:tr(1N∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=1N∑i=1Ntr((xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=1N∑i=1Ntr(Σ−1(xi−xˉ)(xi−xˉ)TΣ−1dΣ=tr(Σ−1SΣ−1dΣ)tr(\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr((\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr({\Sigma} ^{-1} (\boldsymbol x_i-\bar \boldsymbol x)(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma=tr({\Sigma} ^{-1} S{\Sigma} ^{-1} d\Sigma)tr(N1∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=N1∑i=1Ntr((xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=N1∑i=1Ntr(Σ−1(xi−xˉ)(xi−xˉ)TΣ−1dΣ=tr(Σ−1SΣ−1dΣ),其中先交换迹与求和,然后将Σ−1(xi−xˉ)\Sigma ^{-1}(\boldsymbol x_i - \bar \boldsymbol x)Σ−1(xi−xˉ)交换到左边,最后再交换迹与求和,并定义S=1N∑i=1N(xi−xˉ)(xi−xˉ)TS=\frac{1}{N} \sum_{i=1}^N(\boldsymbol x_i - \bar \boldsymbol x)(\boldsymbol x_i - \bar \boldsymbol x)^TS=N1∑i=1N(xi−xˉ)(xi−xˉ)T为样本方差矩阵。得到dl=tr((Σ−1−Σ−1SΣ−1)dΣ)dl=tr(({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1}) d\Sigma)dl=tr((Σ−1−Σ−1SΣ−1)dΣ)。对照导数与微分的联系,有∂l∂Σ=(Σ−1−Σ−1SΣ−1)T\frac{\partial l}{\partial \Sigma}=({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1})^T∂Σ∂l=(Σ−1−Σ−1SΣ−1)T,其零点即Σ\SigmaΣ的最大似然估计为Σ=S\Sigma = SΣ=S。
例6【多元logistic回归】:l=−yTlog softmax(Wx)l=- \boldsymbol y^Tlog \ softmax(W \boldsymbol x)l=−yTlog softmax(Wx),求∂l∂W\frac{\partial l}{\partial W}∂W∂l。其中y\boldsymbol yy是除一个元素为1外其它元素为0的m×1m \times 1m×1列向量,WWW是m×nm \times nm×n矩阵,x\boldsymbol xx是n×1n \times 1n×1矩阵,lll是标量;log表示自然对数,softmax(a)=exp(a)1Texp(a)softmax( \boldsymbol a)=\frac{exp( \boldsymbol a)}{ \boldsymbol 1^T exp( \boldsymbol a)}softmax(a)=1Texp(a)exp(a),其中exp(a)exp(a)exp(a)表示逐元素求指数, 1\boldsymbol 11代表全1向量。
解1:首先将softmax函数代入并写成dl=−yT(log(exp(Wx))−1log(1Texp(Wx)))=−yTWx+log(1Texp(Wx))dl=- \boldsymbol y^T(log(exp(W \boldsymbol x))- \boldsymbol 1log( \boldsymbol 1^Texp(W \boldsymbol x)))=- \boldsymbol y^TW \boldsymbol x+log( \boldsymbol 1^Texp(W \boldsymbol x))dl=−yT(log(exp(Wx))−1log(1Texp(Wx)))=−yTWx+log(1Texp(Wx)),这里要注意逐元素log满足等式log(u/c)=log(u)−1log(c)log( \boldsymbol u /c)=log( \boldsymbol u)- \boldsymbol 1log(c)log(u/c)=log(u)−1log(c),以及y\boldsymbol yy满足yT1=1\boldsymbol y^T \boldsymbol 1 = 1yT1=1。求微分,使用矩阵乘法、逐元素函数等法则:dl=−yTdWx+1T(exp(Wx)⊙(dWx))1Texp(Wx)dl= - \boldsymbol y^TdW\boldsymbol x + \frac{\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))}{\boldsymbol 1^Texp(W\boldsymbol x)}dl=−yTdWx+1Texp(Wx)1T(exp(Wx)⊙(dWx))。再套上迹并做交换,注意可化简1T(exp(Wx)⊙(dWx))=exp(Wx)TdWx\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))=exp(W\boldsymbol x)^TdW\boldsymbol x1T(exp(Wx)⊙(dWx))=exp(Wx)TdWx,这是根据等式1T(u⊙v)=uTv\boldsymbol 1^T(\boldsymbol u \odot \boldsymbol v)=\boldsymbol u^T\boldsymbol v1T(u⊙v)=uTv,故dl=tr(−yTdWx+exp(Wx)TdWx1Texp(Wx))=tr(x(softmax(Wx)−yT)dW)dl=tr\left ( -\boldsymbol y^TdW\boldsymbol x + \frac{exp(W\boldsymbol x)^TdW\boldsymbol x}{1^T exp(W\boldsymbol x)} \right )=tr(\boldsymbol x(softmax(W\boldsymbol x)-\boldsymbol y^T)dW)dl=tr(−yTdWx+1Texp(Wx)exp(Wx)TdWx)=tr(x(softmax(Wx)−yT)dW)。对照导数与微分的联系,得到∂l∂W=(softmax(Wx)−y)xT\frac{\partial l}{\partial W}=(softmax(W\boldsymbol x)-\boldsymbol y)\boldsymbol x^T∂W∂l=(softmax(Wx)−y)xT。
解2:定义a=Wx\boldsymbol a=W\boldsymbol xa=Wx,则l=−yTlog softmax(a)l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a)l=−yTlog softmax(a),先同上求出∂l∂a=softmax(a)−y\frac{\partial l}{\partial \boldsymbol a} = softmax(\boldsymbol a)-\boldsymbol y∂a∂l=softmax(a)−y,再利用复合法则:dl=tr(∂lT∂ada)=tr(∂lT∂adWx)=tr(x∂lT∂adW)dl= tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}d\boldsymbol a \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}dW\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a}dW \right )dl=tr(∂a∂lTda)=tr(∂a∂lTdWx)=tr(x∂a∂lTdW),得到∂l∂W=∂l∂axT\frac{\partial l}{\partial W}= \frac{\partial l}{\partial \boldsymbol a}\boldsymbol x^T∂W∂l=∂a∂lxT。
最后一例留给经典的神经网络。神经网络的求导术是学术史上的重要成果,还有个专门的名字叫做BP算法,我相信如今很多人在初次推导BP算法时也会颇费一番脑筋,事实上使用矩阵求导术来推导并不复杂。为简化起见,我们推导二层神经网络的BP算法。
例7【二层神经网络】:l=−yTlog softmax(W2σ(W1x))l=-\boldsymbol y^Tlog \ softmax(W_2 \sigma(W_1\boldsymbol x))l=−yTlog softmax(W2σ(W1x)),求∂l∂W1\frac{\partial l}{\partial W_1}∂W1∂l和∂l∂W2\frac{\partial l}{\partial W_2}∂W2∂l。其中y\boldsymbol yy是除一个元素为1外其它元素为0的的m×1m \times 1m×1列向量,W2W_2W2是m×pm \times pm×p矩阵,W1W_1W1是p×np \times np×n矩阵,x\boldsymbol xx是n×1n \times 1n×1矩阵,lll是标量;log表示自然对数,softmax(a)=exp(a)1Texp(a)softmax(\boldsymbol a)=\frac{exp(\boldsymbol a)}{\boldsymbol 1^Texp(\boldsymbol a)}softmax(a)=1Texp(a)exp(a)同上,σ\sigmaσ是逐元素sigmoid函数。
解:定义a1=W1x,h1=σ(a1),a2=W2h1\boldsymbol a_1=W_1\boldsymbol x,\boldsymbol h_1=\sigma(\boldsymbol a_1),\boldsymbol a_2=W_2\boldsymbol h_1a1=W1x,h1=σ(a1),a2=W2h1,则l=−yTlog softmax(a2)l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a_2)l=−yTlog softmax(a2)。在前例中已求出∂l∂a2=softmax(a2)−y\frac{\partial l}{\partial \boldsymbol a_2}=softmax(\boldsymbol a_2)-\boldsymbol y∂a2∂l=softmax(a2)−y。使用复合法则,dl=tr(∂lT∂a2da2)=tr(∂lT∂a2dW2h1)+tr(∂lT∂a2W2dh1)⎵dl2dl=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} d\boldsymbol a_2 \right)=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} dW_2\boldsymbol h_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} W_2d\boldsymbol h_1 \right ) }_{dl_2}dl=tr(∂a2∂lTda2)=tr(∂a2∂lTdW2h1)+dl2tr(∂a2∂lTW2dh1),使用矩阵乘法交换的迹技巧从第一项得到∂l∂W2=∂l∂a2h1T\frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial \boldsymbol a_2}h^T_1∂W2∂l=∂a2∂lh1T,从第二项得到∂l∂h1=W2T∂l∂a2\frac{\partial l}{\partial \boldsymbol h_1}=W_2^T\frac{\partial l}{\partial \boldsymbol a_2}∂h1∂l=W2T∂a2∂l。接下来对第二项继续使用复合法则来求∂l∂a1\frac{\partial l}{\partial \boldsymbol a_1}∂a1∂l,并利用矩阵乘法和逐元素乘法交换的迹技巧:dl2=tr(∂lT∂h1dh1)=tr(∂lT∂h1(σ′(a1)⊙da1))=tr((∂l∂h1⊙σ′(a1))Tda1)dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}d\boldsymbol h_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}(\sigma'(\boldsymbol a_1) \odot d\boldsymbol a_1) \right )=tr\left ( \left ( \frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)\right )^Td\boldsymbol a_1 \right )dl2=tr(∂h1∂lTdh1)=tr(∂h1∂lT(σ′(a1)⊙da1))=tr((∂h1∂l⊙σ′(a1))Tda1),得到∂l∂a1=∂l∂h1⊙σ′(a1)\frac{\partial l}{\partial \boldsymbol a_1}=\frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)∂a1∂l=∂h1∂l⊙σ′(a1)为求∂l∂W\frac{\partial l}{\partial W}∂W∂l,再用一次复合法则:dl2=tr(∂lT∂a1da1)=tr(∂lT∂a1dW1x)=tr(x∂lT∂a1dW1)dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}d\boldsymbol a_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1 \right )dl2=tr(∂a1∂lTda1)=tr(∂a1∂lTdW1x)=tr(x∂a1∂lTdW1),得到∂l∂W1=∂l∂a1xT\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol a_1}\boldsymbol x^T∂W1∂l=∂a1∂lxT
推广:样本(x1,y1),....,(xN,yN)(\boldsymbol x_1,\boldsymbol y_1),....,(\boldsymbol x_N,\boldsymbol y_N)(x1,y1),....,(xN,yN),l=−∑i=1NyiTlog softmax(W2σ(W1xi+b1)+b2)l=- \sum_{i=1}^N\boldsymbol y_i^T log \ softmax(W_2 \sigma(W_1\boldsymbol x_i+\boldsymbol b_1)+\boldsymbol b_2)l=−∑i=1NyiTlog softmax(W2σ(W1xi+b1)+b2),b1\boldsymbol b_1b1是p×1p \times 1p×1列向量,b2\boldsymbol b_2b2是m×1m \times 1m×1列向量其余定义同上。
解1:定义a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2\boldsymbol a_{1,i}=W_1\boldsymbol x_i+\boldsymbol b_1,\boldsymbol h_{1,i}=\sigma(\boldsymbol a_{1,i}),\boldsymbol a_{2,i}=W_2\boldsymbol h_{1,i}+\boldsymbol b_2a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2,则l=−∑i=1NyiTlog softmax(a2,i)l=-\sum_{i=1}^N\boldsymbol y_i^T log \ softmax(\boldsymbol a_{2,i})l=−∑i=1NyiTlog softmax(a2,i)。先同上可求出∂l∂a2,i=softmax(a2,i)−yi\frac{\partial l}{\partial \boldsymbol a_{2,i}}=softmax(\boldsymbol a_{2,i})-\boldsymbol y_i∂a2,i∂l=softmax(a2,i)−yi。使用复合法则,dl=tr(∑i=1N∂lT∂a2,ida2,i)=tr(∑i=1N∂lT∂a2,idW2h1,i)+tr(∑i=1N∂lT∂a2,iW2dh1,i)⎵dl2+tr(∑i=1N∂lT∂a2,idb2)dl=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol a_{2,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} dW_2\boldsymbol h_{1,i} \right ) + \underbrace{tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}}W_ 2d\boldsymbol h_{1,i} \right ) }_{dl_2} + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol b_2 \right )dl=tr(∑i=1N∂a2,i∂lTda2,i)=tr(∑i=1N∂a2,i∂lTdW2h1,i)+dl2tr(i=1∑N∂a2,i∂lTW2dh1,i)+tr(∑i=1N∂a2,i∂lTdb2),从第一项得到得到∂l∂W2=∑i=1N∂l∂a2,ih1,iT\frac{\partial l}{\partial W_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }\boldsymbol h_{1,i}^T∂W2∂l=∑i=1N∂a2,i∂lh1,iT,从第二项得到∂l∂h1,i=W2T∂l∂a2,i\frac{\partial l}{\partial \boldsymbol h_{1,i}}=W_2^T\frac{\partial l}{\partial \boldsymbol a_{2,i} }∂h1,i∂l=W2T∂a2,i∂l,从第三项得到∂l∂b2=∑i=1N∂l∂a2,i\frac{\partial l}{\partial \boldsymbol b_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }∂b2∂l=∑i=1N∂a2,i∂l。接下来对第二项继续使用复合法则,得到∂l∂a1,i=∂l∂h1,i⊙σ′(a1,i)\frac{\partial l}{\partial \boldsymbol a_{1,i}}=\frac{\partial l}{\partial \boldsymbol h_{1,i} } \odot \sigma'(\boldsymbol a_{1,i})∂a1,i∂l=∂h1,i∂l⊙σ′(a1,i)。为求∂l∂W1,∂l∂b1\frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1}∂W1∂l,∂b1∂l,再用一次复合法则:dl2=tr(∑i=1N∂lT∂a1,ida1,i)=tr(∑i=1N∂lT∂a1,idW1xi)+tr(∑i=1N∂lT∂a1,idb1)dl_2=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial a_{1,i}} d\boldsymbol a_{1,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} dW_1\boldsymbol x_{i} \right ) + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} d\boldsymbol b_1 \right )dl2=tr(∑i=1N∂a1,i∂lTda1,i)=tr(∑i=1N∂a1,i∂lTdW1xi)+tr(∑i=1N∂a1,i∂lTdb1)得到∂l∂W1=∑i=1N∂l∂a1,ixiT,∂l∂b1=∑i=1N∂l∂a1,i\frac{\partial l}{\partial W_1}= \sum_{i=1}^{N}\frac{\partial l}{\partial \boldsymbol a_{1, i}}x_i^T,\frac{\partial l}{\partial \boldsymbol b_1}=\sum_{i=1}^N \frac{\partial l}{\partial \boldsymbol a_{1, i}}∂W1∂l=∑i=1N∂a1,i∂lxiT,∂b1∂l=∑i=1N∂a1,i∂l
解2:可以用矩阵来表示N个样本,以简化形式。定义X=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21TX=[\boldsymbol x_1, ...,\boldsymbol x_N],A_1=[\boldsymbol a_{1,1},..,\boldsymbol a_{1,N}]=W_1X+\boldsymbol b_1\boldsymbol 1^T,H_1=[\boldsymbol h_{1,1},...,\boldsymbol h_{1,N}]=\sigma(A_1),A_2=[\boldsymbol a_{2,1},..,\boldsymbol a_{2,N}]=W_2H_1+\boldsymbol b_2\boldsymbol 1^TX=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21T,注意这里使用全1向量来扩展维度。先同上求出∂l∂A2=[softmax(a2,1)−y1,...,softmax(a2,N)−yN]\frac{\partial l}{\partial A_2}=[softmax(\boldsymbol a_{2,1})-\boldsymbol y_1,...,softmax(\boldsymbol a_{2,N})-\boldsymbol y_N]∂A2∂l=[softmax(a2,1)−y1,...,softmax(a2,N)−yN]。使用复合法则,dl=tr(∂lT∂A2dA2)=tr(∂lT∂A2dW2H1)+tr(∂lT∂A2W2dH1)⎵dl2+tr(∂lT∂A2db21T)dl=tr\left ( \frac{\partial l^T}{\partial A_2} dA_2\right )=tr\left ( \frac{\partial l^T}{\partial A_2} dW_2H_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{\partial A_2} W_2dH_1 \right )}_{dl_2} + tr\left ( \frac{\partial l^T}{\partial A_2} d\boldsymbol b_2\boldsymbol 1^T \right )dl=tr(∂A2∂lTdA2)=tr(∂A2∂lTdW2H1)+dl2tr(∂A2∂lTW2dH1)+tr(∂A2∂lTdb21T),从第一项得到∂l∂W2=∂l∂A2H1T\frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial A_2}H_1^T∂W2∂l=∂A2∂lH1T,从第二项得到∂l∂W1=W2T∂l∂A2\frac{\partial l}{\partial W_1}=W_2^T\frac{\partial l}{\partial A_2}∂W1∂l=W2T∂A2∂l,从第三项得到到∂l∂b2=∂l∂A21\frac{\partial l}{\partial \boldsymbol b_2}=\frac{\partial l}{\partial A_2}\boldsymbol 1∂b2∂l=∂A2∂l1。接下来对第二项继续使用复合法则,得到∂l∂A1=∂l∂H1⊙σ′(A1)\frac{\partial l}{\partial A_1}=\frac{\partial l}{\partial H_1} \odot \sigma'(A_1)∂A1∂l=∂H1∂l⊙σ′(A1)。为求∂l∂W1,∂l∂b1\frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1}∂W1∂l,∂b1∂l,再用一次复合法则:dl2=tr(∂lT∂A1dA1)=tr(∂lT∂A1dW1X)+tr(∂lT∂A1db11T)dl_2=tr\left ( \frac{\partial l^T}{\partial A_1}dA_1 \right )=tr\left ( \frac{\partial l^T}{\partial A_1}dW_1X \right )+tr\left ( \frac{\partial l^T}{\partial A_1}d\boldsymbol b_1\boldsymbol 1^T \right )dl2=tr(∂A1∂lTdA1)=tr(∂A1∂lTdW1X)+tr(∂A1∂lTdb11T),得到∂l∂W1=∂l∂A1XT,∂l∂b1=∂l∂A11\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial A_1}X^T,\frac{\partial l}{\partial \boldsymbol b_1}=\frac{\partial l}{\partial A_1}\boldsymbol 1∂W1∂l=∂A1∂lXT,∂b1∂l=∂A1∂l1