矩阵求导术

矩阵求导的技术,在统计学、控制论、机器学习等领域有广泛的应用。本文来做个科普,分作两篇,上篇讲标量对矩阵的求导术,下篇讲矩阵对矩阵的求导术。本文使用小写字母xxx表示标量,粗体小写字母x\boldsymbol{x}x表示(列)向量,大写字母XXX表示矩阵。

首先来琢磨一下定义,标量fff对矩阵XXX的导数,定义为∂f∂X=[∂f∂Xij]\frac{\partial f}{\partial X}=\left [ \frac{\partial f}{\partial X_{ij}} \right ]Xf=[Xijf],即fffXXX逐元素求导排成与XXX尺寸相同的矩阵。然而,这个定义在计算中并不好用,实用上的原因是对函数较复杂的情形难以逐元素求导;哲理上的原因是逐元素求导破坏了整体性。试想,为何要将fff看做矩阵XXX而不是各元素XijX_{ij}Xij的函数呢?答案是用矩阵运算更整洁。所以在求导时不宜拆开矩阵,而是要找一个从整体出发的算法。

为此,我们来回顾,一元微积分中的导数(标量对标量的导数)与微分有联系:df=f′(x)dxdf={f}'(x)dxdf=f(x)dx;多元微积分中的梯度(标量对向量的导数)也与微分有联系:df=∑i=1n∂f∂xidxi=∂fT∂xdxd f=\sum_{i=1}^n \frac{\partial f}{\partial x_i}dx_i=\frac{\partial f^T}{\partial \boldsymbol x}d\boldsymbol xdf=i=1nxifdxi=xfTdx,这里第一个等号是全微分公式,第二个等号表达了梯度与微分的联系:全微分dfdfdf是梯度向量∂f∂x(n×1)\frac{\partial f}{\partial \boldsymbol x}(n\times 1)xf(n×1)与微分向量dx(x×1)d \boldsymbol x(x \times 1)dx(x×1)的内积;受此启发,我们将矩阵导数与微分建立联系:df=∑i=1m∑j=1n∂f∂XijdXij=tr(∂fT∂XdX)df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial X_{ij}} dX_{ij}=tr\left ( \frac{\partial f^T}{\partial X} dX \right )df=i=1mj=1nXijfdXij=tr(XfTdX)​​​​​​​。其中tr代表迹(trace)是方阵对角线元素之和,满足性质:对尺寸相同的矩阵A,B,tr(ATB)=∑i,jAijBijtr\left ( A^TB \right )=\sum_{i,j}A_{ij}B_{ij}tr(ATB)=i,jAijBij​​​​​​​,即tr(ATB)tr\left ( A^TB \right )tr(ATB)是矩阵A,B的内积。与梯度相似,这里第一个等号是全微分公式,第二个等号表达了矩阵导数与微分的联系:全微分dfdfdf是导数∂f∂X(m×n)\frac{\partial f}{\partial X}(m\times n)Xf(m×n)与微分矩阵dX(m×n)dX(m\times n)dX(m×n)的内积。

然后来建立运算法则。回想遇到较复杂的一元函数如f=log(2+sin x)exf=log(2+sin \ x)e^{\sqrt{x}}f=log(2+sin x)ex,我们是如何求导的呢?通常不是从定义开始求极限,而是先建立了初等函数求导和四则运算、复合等法则,再来运用这些法则。故而,我们来创立常用的矩阵微分的运算法则:
1.
加减法:d(X±Y)=dX±dYd(X\pm Y)=dX\pm dYd(X±Y)=dX±dY
矩阵乘法:d(XY)=(dX)Y+X(dY)d(XY)=(dX)Y+X(dY)d(XY)=(dX)Y+X(dY)
转置:d(XT)=(dX)Td(X^T)=(dX)^Td(XT)=(dX)T
迹:dtr(X)=tr(dX)dtr(X)=tr(dX)dtr(X)=tr(dX)
2.
逆:dX−1=−X−1dXX−1dX^{-1}=-X^{-1}dXX^{-1}dX1=X1dXX1。此式可在XX−1=IXX^{-1}=IXX1=I两侧求微分来证明。
3.
行列式:d∣X∣=tr(X#dX)d\left | X \right |=tr(X^{\#}dX)dX=tr(X#dX),其中X#X^{\#}X#表示XXX的伴随矩阵,在XXX可逆时又可以写作d∣X∣=∣X∣tr(X−1dX)d\left | X \right |=\left | X \right |tr\left ( X^{-1}dX \right )dX=Xtr(X1dX)。此式可用Laplace展开来证明,详见张贤达《矩阵分析与应用》第279页。
4.
逐元素乘法:d(X⊙Y)=dX⊙Y+X⊙dYd(X\odot Y)=dX \odot Y + X \odot dYd(XY)=dXY+XdY⊙\odot表示尺寸相同的矩阵X,Y逐元素相乘。
5.
逐元素函数:dσ(X)=σ′(X)⊙dX, σ(X)=[σ(Xij)]d \sigma (X)= \sigma'(X) \odot dX,\ \sigma (X)=\left [ \sigma (X_{ij}) \right ]dσ(X)=σ(X)dX, σ(X)=[σ(Xij)],是逐元素标量函数运算,σ′(X)=[σ′(Xij)]{\sigma }'(X)=\left [ {\sigma }'(X_{ij}) \right ]σ(X)=[σ(Xij)]是逐元素求导数。举个例子:
X=[x11x12x21x22], dsin⁡(X)=[cos x11dx11cos x12dx12cos x21dx21cos x22dx22]=cos(X)⊙dXX=\begin{bmatrix} x_{11} & x_{12}\\ x_{21} & x_{22} \end{bmatrix},\ d\sin(X)=\begin{bmatrix} cos \ x_{11}dx_{11} &cos \ x_{12}dx_{12} \\ cos \ x_{21}dx_{21} &cos \ x_{22}dx_{22} \end{bmatrix}=cos(X) \odot dXX=[x11x21x12x22], dsin(X)=[cos x11dx11cos x21dx21cos x12dx12cos x22dx22]=cos(X)dX
我们试图利用矩阵导数与微分的联系df=tr(∂fT∂XdX)df = tr\left ( \frac{\partial f^T}{\partial X} dX\right )df=tr(XfTdX),在求出左侧的微分dfdfdf后,该如何写成右侧的形式并得到导数呢?这需要一些迹技巧(trace trick):
1.标量套上迹:a=tr(a)a=tr(a)a=tr(a)
2.转置:tr(AT)=tr(A)tr(A^T)=tr(A)tr(AT)=tr(A)
3.线性:tr(A±B)=tr(A)±tr(B)tr(A\pm B)=tr(A)\pm tr(B)tr(A±B)=tr(A)±tr(B)
4.矩阵乘法交换:tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA),其中AAABTB^TBT尺寸相同。两侧都等于∑ijAijBji\sum_{ij}A_{ij}B_{ji}ijAijBji
5.矩阵乘法/逐元素乘法交换:tr(AT(B⊙C))=tr((A⊙B)TC)tr(A^T(B \odot C))=tr((A \odot B)^TC)tr(AT(BC))=tr((AB)TC)其中尺寸相同。两侧都等于∑i,jAijBijCij\sum_{i,j}A_{ij}B_{ij}C_{ij}i,jAijBijCij

观察一下可以断言,若标量函数fff是矩阵XXX经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对fff求微分,再使用迹技巧给dfdfdf套上迹并将其它项交换至dXdXdX左侧,即能得到导数。

在建立法则的最后,来谈一谈复合:假设已求得∂f∂Y\frac{\partial f}{\partial Y}Yf,而Y是X的函数,如何求∂f∂X\frac{\partial f}{\partial X}Xf呢?在微积分中有标量求导的链式法则∂f∂x=∂f∂y∂y∂x\frac{\partial f}{\partial x}=\frac{\partial f}{\partial y}\frac{\partial y}{\partial x}xf=yfxy,但这里我们不能沿用链式法则,因为矩阵对矩阵的导数∂Y∂X\frac{\partial Y}{\partial X}XY截至目前仍是未定义的。于是我们继续追本溯源,链式法则是从何而来?源头仍然是微分。我们直接从微分入手建立复合法则:先写出df=tr(∂fT∂YdY)df=tr(\frac{\partial f^T}{\partial Y}dY)df=tr(YfTdY)再将dYdYdYdXdXdX表示出来代入,并使用迹技巧将其他项交换至dXdXdX左侧,即可得到∂f∂X\frac{\partial f}{\partial X}Xf

接下来演示一些算例。特别提醒要依据已经建立的运算法则来计算,不能随意套用微积分中标量导数的结论,比如认为AXAXAXXXX的导数为AAA,这是没有根据、意义不明的。

例1:f=aTXbf=\boldsymbol a^TX \boldsymbol bf=aTXb,求∂f∂X\frac{\partial f}{\partial X}Xf。其中a\boldsymbol aam×1m \times 1m×1列向量,XXXm×nm \times nm×n矩阵,b\boldsymbol bbn×1n \times 1n×1列向量, fff是标量。

解:先使用矩阵乘法法则求微分,这里的a,b\boldsymbol a,\boldsymbol ba,b是常量,da=0,db=0d\boldsymbol a=0,d\boldsymbol b=0da=0,db=0,得到:df=aTdXbdf=\boldsymbol a^TdX \boldsymbol bdf=aTdXb,再套上迹并做矩阵乘法交换:df=tr(aTdXb)=tr(baTdX)df=tr(\boldsymbol a^TdX\boldsymbol b)=tr(\boldsymbol b \boldsymbol a^TdX)df=tr(aTdXb)=tr(baTdX),注意这里我们根据tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA)交换了aTdX\boldsymbol a^TdXaTdXb\boldsymbol bb。对照导数与微分的联系df=tr(∂fT∂XdX)df=tr\left ( \frac{\partial f^T}{\partial X} dX \right )df=tr(XfTdX),得到∂f∂X=(baT)T=abT\frac{\partial f}{\partial X}=(\boldsymbol b \boldsymbol a^T)^T=\boldsymbol a \boldsymbol b^TXf=(baT)T=abT

注意:这里不能用∂f∂X=aT∂X∂Xb=?\frac{\partial f}{\partial X}=\boldsymbol a^T\frac{\partial X}{\partial X}\boldsymbol b=?Xf=aTXXb=?,导数与乘常数矩阵的交换是不合法则的运算(而微分是合法的)。有些资料在计算矩阵导数时,会略过求微分这一步,这是逻辑上解释不通的。

例2:f=aTexp(Xb)f= \boldsymbol a^Texp(X \boldsymbol b)f=aTexp(Xb),求∂f∂X\frac{\partial f}{\partial X}Xf。其中a\boldsymbol aam×1m \times 1m×1列向量,其中XXXm×nm \times nm×n矩阵,其中b\boldsymbol bbn×1n \times 1n×1列向量,expexpexp表示逐元素求指数,fff是标量。

解:先使用矩阵乘法、逐元素函数法则求微分:df=aT(exp(Xb))⊙(dXb)df=\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)df=aT(exp(Xb))(dXb),再套上迹并做交换:df=tr(aT(exp(Xb))⊙(dXb)))=tr((a⊙exp(Xb))TdXb)=tr(b(a⊙exp(Xb))TdX)df=tr(\boldsymbol a^T(exp(X\boldsymbol b))\odot(dX\boldsymbol b)))=tr((\boldsymbol a \odot exp(X\boldsymbol b))^TdX\boldsymbol b)=tr(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^TdX)df=tr(aT(exp(Xb))(dXb)))=tr((aexp(Xb))TdXb)=tr(b(aexp(Xb))TdX),注意这里我们先根据tr(AT(B⊙C))=tr((A⊙B)TC)tr(A^T(B \odot C))=tr((A \odot B)^T C)tr(AT(BC))=tr((AB)TC)交换了a,exp(Xb)\boldsymbol a,exp(X\boldsymbol b)a,exp(Xb)dXbdX\boldsymbol bdXb,再根据tr(AB)=tr(BA)tr(AB)=tr(BA)tr(AB)=tr(BA)交换了(a⊙exp(Xb))TdX(\boldsymbol a \odot exp(X\boldsymbol b))^TdX(aexp(Xb))TdXb\boldsymbol bb。对照导数与微分的联系df=tr(∂fT∂XdX)df=tr\left ( \frac{\partial f^T}{\partial X} dX\right )df=tr(XfTdX),得到∂f∂X=(b(a⊙exp(Xb))T)T=(a⊙exp(Xb))bT\frac{\partial f}{\partial X}=(\boldsymbol b(\boldsymbol a \odot exp(X\boldsymbol b))^T)^T=(\boldsymbol a \odot exp(X\boldsymbol b))\boldsymbol b^TXf=(b(aexp(Xb))T)T=(aexp(Xb))bT

例3:f=tr(YTMY),Y=σ(WX)f=tr(Y^TMY),Y=\sigma(WX)f=tr(YTMY),Y=σ(WX)∂f∂X\frac{\partial f}{\partial X}Xf其中WWWl×ml \times ml×m矩阵,其中XXXm×nm \times nm×n矩阵,其中YYYl×nl \times nl×n矩阵,其中MMMl×ll \times ll×l矩阵,σ\sigmaσ是逐元素函数,fff是标量。

解:先求∂f∂Y\frac{\partial f}{\partial Y}Yf,求微分,使用矩阵乘法、转置法则:df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY)df=tr((dY)^TMY)+tr(Y^TMdY)=tr(Y^TM^TdY)+tr(Y^TMdY)=tr(Y^T(M+M^T)dY)df=tr((dY)TMY)+tr(YTMdY)=tr(YTMTdY)+tr(YTMdY)=tr(YT(M+MT)dY),对照导数与微分的联系,得到∂f∂Y=(M+MT)Y=2MY\frac{\partial f}{\partial Y}=(M+M^T)Y=2MYYf=(M+MT)Y=2MY,这里是对称矩阵。为求∂f∂X\frac{\partial f}{\partial X}Xf,写出df=tr(∂fT∂YdY)df=tr\left ( \frac{\partial f^T}{\partial Y} dY \right )df=tr(YfTdY),再将dYdYdYdXdXdX表示出来代入,并使用矩阵乘法/逐元素乘法交换:df=tr(∂fT∂Y(σ′(WX)⊙(WdX))=tr((∂f∂Y⊙σ′(WX))TWdX)df=tr\left ( \frac{\partial f^T}{\partial Y}({\sigma}'(WX) \odot (WdX) \right )=tr\left ( {\left ( \frac{\partial f}{\partial Y} \odot {\sigma}'(WX) \right )}^T WdX \right )df=tr(YfT(σ(WX)(WdX))=tr((Yfσ(WX))TWdX),对照导数与微分的联系,得到∂f∂X=WT(∂f∂Y⊙σ′(WX))=WT((2Mσ(WXX))⊙σ′(WX))\frac{\partial f}{\partial X}=W^T\left ( \frac{\partial f}{\partial Y} \odot {\sigma}' (WX)\right )=W^T((2M\sigma(WXX)) \odot {\sigma}'(WX))Xf=WT(Yfσ(WX))=WT((2Mσ(WXX))σ(WX))

例4【线性回归】:l=∥Xω−y∥2l=\left \| X \boldsymbol \omega - \boldsymbol y \right \|^2l=Xωy2, 求ω\boldsymbol \omegaω的最小二乘估计,即求∂l∂ω\frac{\partial l}{\partial \boldsymbol \omega }ωl的零点。其中y\boldsymbol yym×1m \times 1m×1列向量,XXXm×nm \times nm×n矩阵,ω\boldsymbol \omegaωn×1n \times 1n×1列向量,lll是标量。

解:严格来说这是标量对向量的导数,不过可以把向量看做矩阵的特例。先将向量模平方改写成向量与自身的内积:l=(Xω−y)T(Xω−y)l=(X\boldsymbol \omega - \boldsymbol y)^T(X\boldsymbol \omega - \boldsymbol y)l=(Xωy)T(Xωy),求微分,使用矩阵乘法、转置等法则:dl=(Xdω)T(Xω−y)+(Xω−y)T(Xdω)=2(Xω−y)TXdωdl=(Xd\boldsymbol \omega )^T(X\boldsymbol \omega - \boldsymbol y)+(X\boldsymbol \omega - \boldsymbol y)^T(Xd\boldsymbol \omega )=2(X\boldsymbol \omega - \boldsymbol y)^TXd\boldsymbol \omegadl=(Xdω)T(Xωy)+(Xωy)T(Xdω)=2(Xωy)TXdω。对照导数与微分的联系dl=∂lT∂ωdωdl=\frac{\partial l^T}{\partial \boldsymbol \omega }d\boldsymbol \omegadl=ωlTdω,得到∂l∂ω=(2(Xω−y)TX)T=2XT(Xω−y)\frac{\partial l}{\partial \boldsymbol \omega }=(2(X\boldsymbol \omega - \boldsymbol y)^TX)^T=2X^T(X\boldsymbol \omega - \boldsymbol y)ωl=(2(Xωy)TX)T=2XT(Xωy)∂l∂ω\frac{\partial l}{\partial \boldsymbol \omega }ωl零点即ω\boldsymbol \omegaω的最小二乘估计为ω=(XTX)−1XTy\boldsymbol \omega =(X^TX)^{-1}X^T\boldsymbol yω=(XTX)1XTy

例5【方差的最大似然估计】:样本x1,...,xN∼λ(μ,Σ)\boldsymbol x_1,...,\boldsymbol x_N\sim \lambda (\boldsymbol \mu , \Sigma )x1,...,xNλ(μ,Σ),求方差Σ\SigmaΣ的最大似然估计。写成数学式是:l=log∣Σ∣+1N∑i=1N(xi−xˉ)TΣ−1(xi−xˉ)l=log\left | \Sigma \right |+\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T\Sigma ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)l=logΣ+N1i=1N(xixˉ)TΣ1(xixˉ),求∂l∂Σ\frac{\partial l}{\partial \Sigma }Σl的零点。其中xi\boldsymbol x_{i}xim×1m \times 1m×1向量,Σ\SigmaΣm×mm \times mm×m对称正定矩阵,xˉ=1N∑i=1Nxi\bar \boldsymbol x= \frac{1}{N} \sum_{i=1}^{N}\boldsymbol x_ixˉ=N1i=1Nxi是样本均值,lll是标量,log表示自然对数。

解:首先求微分,使用矩阵乘法、行列式、逆等运算法则,第一项是dlog∣Σ∣=∣Σ∣−1d∣Σ∣=tr(∣Σ∣−1dΣ)dlog\left | \Sigma \right |=\left | \Sigma \right |^{-1}d\left | \Sigma \right |=tr(\left | \Sigma \right |^{-1}d\Sigma )dlogΣ=Σ1dΣ=tr(Σ1dΣ),第二项是1N∑i=1N(xi−xˉ)TdΣ−1(xi−xˉ)=−1N∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ)\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^Td{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)= - \frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x)N1i=1N(xixˉ)TdΣ1(xixˉ)=N1i=1N(xixˉ)TΣ1dΣΣ1(xixˉ)。再给第二项套上迹做交换:tr(1N∑i=1N(xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=1N∑i=1Ntr((xi−xˉ)TΣ−1dΣΣ−1(xi−xˉ))=1N∑i=1Ntr(Σ−1(xi−xˉ)(xi−xˉ)TΣ−1dΣ=tr(Σ−1SΣ−1dΣ)tr(\frac{1}{N}\sum_{i=1}^N(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr((\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma{\Sigma} ^{-1}(\boldsymbol x_i-\bar \boldsymbol x))= \frac{1}{N}\sum_{i=1}^N tr({\Sigma} ^{-1} (\boldsymbol x_i-\bar \boldsymbol x)(\boldsymbol x_i-\bar \boldsymbol x)^T {\Sigma} ^{-1} d\Sigma=tr({\Sigma} ^{-1} S{\Sigma} ^{-1} d\Sigma)tr(N1i=1N(xixˉ)TΣ1dΣΣ1(xixˉ))=N1i=1Ntr((xixˉ)TΣ1dΣΣ1(xixˉ))=N1i=1Ntr(Σ1(xixˉ)(xixˉ)TΣ1dΣ=tr(Σ1SΣ1dΣ),其中先交换迹与求和,然后将Σ−1(xi−xˉ)\Sigma ^{-1}(\boldsymbol x_i - \bar \boldsymbol x)Σ1(xixˉ)交换到左边,最后再交换迹与求和,并定义S=1N∑i=1N(xi−xˉ)(xi−xˉ)TS=\frac{1}{N} \sum_{i=1}^N(\boldsymbol x_i - \bar \boldsymbol x)(\boldsymbol x_i - \bar \boldsymbol x)^TS=N1i=1N(xixˉ)(xixˉ)T为样本方差矩阵。得到dl=tr((Σ−1−Σ−1SΣ−1)dΣ)dl=tr(({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1}) d\Sigma)dl=tr((Σ1Σ1SΣ1)dΣ)。对照导数与微分的联系,有∂l∂Σ=(Σ−1−Σ−1SΣ−1)T\frac{\partial l}{\partial \Sigma}=({\Sigma} ^{-1} - \Sigma^{-1} S{\Sigma} ^{-1})^TΣl=(Σ1Σ1SΣ1)T,其零点即Σ\SigmaΣ的最大似然估计为Σ=S\Sigma = SΣ=S

例6【多元logistic回归】:l=−yTlog softmax(Wx)l=- \boldsymbol y^Tlog \ softmax(W \boldsymbol x)l=yTlog softmax(Wx),求∂l∂W\frac{\partial l}{\partial W}Wl。其中y\boldsymbol yy是除一个元素为1外其它元素为0的m×1m \times 1m×1列向量,WWWm×nm \times nm×n矩阵,x\boldsymbol xxn×1n \times 1n×1矩阵,lll是标量;log表示自然对数,softmax(a)=exp(a)1Texp(a)softmax( \boldsymbol a)=\frac{exp( \boldsymbol a)}{ \boldsymbol 1^T exp( \boldsymbol a)}softmax(a)=1Texp(a)exp(a),其中exp(a)exp(a)exp(a)表示逐元素求指数, 1\boldsymbol 11代表全1向量。

解1:首先将softmax函数代入并写成dl=−yT(log(exp(Wx))−1log(1Texp(Wx)))=−yTWx+log(1Texp(Wx))dl=- \boldsymbol y^T(log(exp(W \boldsymbol x))- \boldsymbol 1log( \boldsymbol 1^Texp(W \boldsymbol x)))=- \boldsymbol y^TW \boldsymbol x+log( \boldsymbol 1^Texp(W \boldsymbol x))dl=yT(log(exp(Wx))1log(1Texp(Wx)))=yTWx+log(1Texp(Wx)),这里要注意逐元素log满足等式log(u/c)=log(u)−1log(c)log( \boldsymbol u /c)=log( \boldsymbol u)- \boldsymbol 1log(c)log(u/c)=log(u)1log(c),以及y\boldsymbol yy满足yT1=1\boldsymbol y^T \boldsymbol 1 = 1yT1=1。求微分,使用矩阵乘法、逐元素函数等法则:dl=−yTdWx+1T(exp(Wx)⊙(dWx))1Texp(Wx)dl= - \boldsymbol y^TdW\boldsymbol x + \frac{\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))}{\boldsymbol 1^Texp(W\boldsymbol x)}dl=yTdWx+1Texp(Wx)1T(exp(Wx)(dWx))。再套上迹并做交换,注意可化简1T(exp(Wx)⊙(dWx))=exp(Wx)TdWx\boldsymbol 1^T(exp(W\boldsymbol x) \odot (dW\boldsymbol x))=exp(W\boldsymbol x)^TdW\boldsymbol x1T(exp(Wx)(dWx))=exp(Wx)TdWx,这是根据等式1T(u⊙v)=uTv\boldsymbol 1^T(\boldsymbol u \odot \boldsymbol v)=\boldsymbol u^T\boldsymbol v1T(uv)=uTv,故dl=tr(−yTdWx+exp(Wx)TdWx1Texp(Wx))=tr(x(softmax(Wx)−yT)dW)dl=tr\left ( -\boldsymbol y^TdW\boldsymbol x + \frac{exp(W\boldsymbol x)^TdW\boldsymbol x}{1^T exp(W\boldsymbol x)} \right )=tr(\boldsymbol x(softmax(W\boldsymbol x)-\boldsymbol y^T)dW)dl=tr(yTdWx+1Texp(Wx)exp(Wx)TdWx)=tr(x(softmax(Wx)yT)dW)。对照导数与微分的联系,得到∂l∂W=(softmax(Wx)−y)xT\frac{\partial l}{\partial W}=(softmax(W\boldsymbol x)-\boldsymbol y)\boldsymbol x^TWl=(softmax(Wx)y)xT

解2:定义a=Wx\boldsymbol a=W\boldsymbol xa=Wx,则l=−yTlog softmax(a)l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a)l=yTlog softmax(a),先同上求出∂l∂a=softmax(a)−y\frac{\partial l}{\partial \boldsymbol a} = softmax(\boldsymbol a)-\boldsymbol yal=softmax(a)y,再利用复合法则:dl=tr(∂lT∂ada)=tr(∂lT∂adWx)=tr(x∂lT∂adW)dl= tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}d\boldsymbol a \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a}dW\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a}dW \right )dl=tr(alTda)=tr(alTdWx)=tr(xalTdW),得到∂l∂W=∂l∂axT\frac{\partial l}{\partial W}= \frac{\partial l}{\partial \boldsymbol a}\boldsymbol x^TWl=alxT

最后一例留给经典的神经网络。神经网络的求导术是学术史上的重要成果,还有个专门的名字叫做BP算法,我相信如今很多人在初次推导BP算法时也会颇费一番脑筋,事实上使用矩阵求导术来推导并不复杂。为简化起见,我们推导二层神经网络的BP算法。

例7【二层神经网络】:l=−yTlog softmax(W2σ(W1x))l=-\boldsymbol y^Tlog \ softmax(W_2 \sigma(W_1\boldsymbol x))l=yTlog softmax(W2σ(W1x)),求∂l∂W1\frac{\partial l}{\partial W_1}W1l∂l∂W2\frac{\partial l}{\partial W_2}W2l。其中y\boldsymbol yy是除一个元素为1外其它元素为0的的m×1m \times 1m×1列向量,W2W_2W2m×pm \times pm×p矩阵,W1W_1W1p×np \times np×n矩阵,x\boldsymbol xxn×1n \times 1n×1矩阵,lll是标量;log表示自然对数,softmax(a)=exp(a)1Texp(a)softmax(\boldsymbol a)=\frac{exp(\boldsymbol a)}{\boldsymbol 1^Texp(\boldsymbol a)}softmax(a)=1Texp(a)exp(a)同上,σ\sigmaσ是逐元素sigmoid函数。

解:定义a1=W1x,h1=σ(a1),a2=W2h1\boldsymbol a_1=W_1\boldsymbol x,\boldsymbol h_1=\sigma(\boldsymbol a_1),\boldsymbol a_2=W_2\boldsymbol h_1a1=W1x,h1=σ(a1),a2=W2h1,则l=−yTlog softmax(a2)l=-\boldsymbol y^Tlog \ softmax(\boldsymbol a_2)l=yTlog softmax(a2)。在前例中已求出∂l∂a2=softmax(a2)−y\frac{\partial l}{\partial \boldsymbol a_2}=softmax(\boldsymbol a_2)-\boldsymbol ya2l=softmax(a2)y。使用复合法则,dl=tr(∂lT∂a2da2)=tr(∂lT∂a2dW2h1)+tr(∂lT∂a2W2dh1)⎵dl2dl=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} d\boldsymbol a_2 \right)=tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} dW_2\boldsymbol h_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{ \partial \boldsymbol a_2} W_2d\boldsymbol h_1 \right ) }_{dl_2}dl=tr(a2lTda2)=tr(a2lTdW2h1)+dl2tr(a2lTW2dh1),使用矩阵乘法交换的迹技巧从第一项得到∂l∂W2=∂l∂a2h1T\frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial \boldsymbol a_2}h^T_1W2l=a2lh1T,从第二项得到∂l∂h1=W2T∂l∂a2\frac{\partial l}{\partial \boldsymbol h_1}=W_2^T\frac{\partial l}{\partial \boldsymbol a_2}h1l=W2Ta2l。接下来对第二项继续使用复合法则来求∂l∂a1\frac{\partial l}{\partial \boldsymbol a_1}a1l,并利用矩阵乘法和逐元素乘法交换的迹技巧:dl2=tr(∂lT∂h1dh1)=tr(∂lT∂h1(σ′(a1)⊙da1))=tr((∂l∂h1⊙σ′(a1))Tda1)dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}d\boldsymbol h_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol h_1}(\sigma'(\boldsymbol a_1) \odot d\boldsymbol a_1) \right )=tr\left ( \left ( \frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)\right )^Td\boldsymbol a_1 \right )dl2=tr(h1lTdh1)=tr(h1lT(σ(a1)da1))=tr((h1lσ(a1))Tda1),得到∂l∂a1=∂l∂h1⊙σ′(a1)\frac{\partial l}{\partial \boldsymbol a_1}=\frac{\partial l}{\partial \boldsymbol h_1} \odot \sigma'(\boldsymbol a_1)a1l=h1lσ(a1)为求∂l∂W\frac{\partial l}{\partial W}Wl,再用一次复合法则:dl2=tr(∂lT∂a1da1)=tr(∂lT∂a1dW1x)=tr(x∂lT∂a1dW1)dl_2=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}d\boldsymbol a_1 \right )=tr\left ( \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1\boldsymbol x \right )=tr\left ( \boldsymbol x \frac{\partial l^T}{\partial \boldsymbol a_1}dW_1 \right )dl2=tr(a1lTda1)=tr(a1lTdW1x)=tr(xa1lTdW1),得到∂l∂W1=∂l∂a1xT\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol a_1}\boldsymbol x^TW1l=a1lxT

推广:样本(x1,y1),....,(xN,yN)(\boldsymbol x_1,\boldsymbol y_1),....,(\boldsymbol x_N,\boldsymbol y_N)(x1,y1),....,(xN,yN)l=−∑i=1NyiTlog softmax(W2σ(W1xi+b1)+b2)l=- \sum_{i=1}^N\boldsymbol y_i^T log \ softmax(W_2 \sigma(W_1\boldsymbol x_i+\boldsymbol b_1)+\boldsymbol b_2)l=i=1NyiTlog softmax(W2σ(W1xi+b1)+b2)b1\boldsymbol b_1b1p×1p \times 1p×1列向量,b2\boldsymbol b_2b2m×1m \times 1m×1列向量其余定义同上。

解1:定义a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2\boldsymbol a_{1,i}=W_1\boldsymbol x_i+\boldsymbol b_1,\boldsymbol h_{1,i}=\sigma(\boldsymbol a_{1,i}),\boldsymbol a_{2,i}=W_2\boldsymbol h_{1,i}+\boldsymbol b_2a1,i=W1xi+b1,h1,i=σ(a1,i),a2,i=W2h1,i+b2​​​​​,则l=−∑i=1NyiTlog softmax(a2,i)l=-\sum_{i=1}^N\boldsymbol y_i^T log \ softmax(\boldsymbol a_{2,i})l=i=1NyiTlog softmax(a2,i)。先同上可求出∂l∂a2,i=softmax(a2,i)−yi\frac{\partial l}{\partial \boldsymbol a_{2,i}}=softmax(\boldsymbol a_{2,i})-\boldsymbol y_ia2,il=softmax(a2,i)yi。使用复合法则,dl=tr(∑i=1N∂lT∂a2,ida2,i)=tr(∑i=1N∂lT∂a2,idW2h1,i)+tr(∑i=1N∂lT∂a2,iW2dh1,i)⎵dl2+tr(∑i=1N∂lT∂a2,idb2)dl=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol a_{2,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} dW_2\boldsymbol h_{1,i} \right ) + \underbrace{tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}}W_ 2d\boldsymbol h_{1,i} \right ) }_{dl_2} + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{2,i}} d\boldsymbol b_2 \right )dl=tr(i=1Na2,ilTda2,i)=tr(i=1Na2,ilTdW2h1,i)+dl2tr(i=1Na2,ilTW2dh1,i)+tr(i=1Na2,ilTdb2),从第一项得到得到∂l∂W2=∑i=1N∂l∂a2,ih1,iT\frac{\partial l}{\partial W_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }\boldsymbol h_{1,i}^TW2l=i=1Na2,ilh1,iT,从第二项得到∂l∂h1,i=W2T∂l∂a2,i\frac{\partial l}{\partial \boldsymbol h_{1,i}}=W_2^T\frac{\partial l}{\partial \boldsymbol a_{2,i} }h1,il=W2Ta2,il,从第三项得到∂l∂b2=∑i=1N∂l∂a2,i\frac{\partial l}{\partial \boldsymbol b_2}=\sum_{i=1}^N\frac{\partial l}{\partial \boldsymbol a_{2,i} }b2l=i=1Na2,il。接下来对第二项继续使用复合法则,得到∂l∂a1,i=∂l∂h1,i⊙σ′(a1,i)\frac{\partial l}{\partial \boldsymbol a_{1,i}}=\frac{\partial l}{\partial \boldsymbol h_{1,i} } \odot \sigma'(\boldsymbol a_{1,i})a1,il=h1,ilσ(a1,i)。为求∂l∂W1,∂l∂b1\frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1}W1l,b1l,再用一次复合法则:dl2=tr(∑i=1N∂lT∂a1,ida1,i)=tr(∑i=1N∂lT∂a1,idW1xi)+tr(∑i=1N∂lT∂a1,idb1)dl_2=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial a_{1,i}} d\boldsymbol a_{1,i}\right )=tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} dW_1\boldsymbol x_{i} \right ) + tr\left ( \sum_{i=1}^N\frac{\partial l^T}{\partial \boldsymbol a_{1,i}} d\boldsymbol b_1 \right )dl2=tr(i=1Na1,ilTda1,i)=tr(i=1Na1,ilTdW1xi)+tr(i=1Na1,ilTdb1)得到∂l∂W1=∑i=1N∂l∂a1,ixiT,∂l∂b1=∑i=1N∂l∂a1,i\frac{\partial l}{\partial W_1}= \sum_{i=1}^{N}\frac{\partial l}{\partial \boldsymbol a_{1, i}}x_i^T,\frac{\partial l}{\partial \boldsymbol b_1}=\sum_{i=1}^N \frac{\partial l}{\partial \boldsymbol a_{1, i}}W1l=i=1Na1,ilxiT,b1l=i=1Na1,il

解2:可以用矩阵来表示N个样本,以简化形式。定义X=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21TX=[\boldsymbol x_1, ...,\boldsymbol x_N],A_1=[\boldsymbol a_{1,1},..,\boldsymbol a_{1,N}]=W_1X+\boldsymbol b_1\boldsymbol 1^T,H_1=[\boldsymbol h_{1,1},...,\boldsymbol h_{1,N}]=\sigma(A_1),A_2=[\boldsymbol a_{2,1},..,\boldsymbol a_{2,N}]=W_2H_1+\boldsymbol b_2\boldsymbol 1^TX=[x1,...,xN],A1=[a1,1,..,a1,N]=W1X+b11T,H1=[h1,1,...,h1,N]=σ(A1),A2=[a2,1,..,a2,N]=W2H1+b21T,注意这里使用全1向量来扩展维度。先同上求出∂l∂A2=[softmax(a2,1)−y1,...,softmax(a2,N)−yN]\frac{\partial l}{\partial A_2}=[softmax(\boldsymbol a_{2,1})-\boldsymbol y_1,...,softmax(\boldsymbol a_{2,N})-\boldsymbol y_N]A2l=[softmax(a2,1)y1,...,softmax(a2,N)yN]。使用复合法则,dl=tr(∂lT∂A2dA2)=tr(∂lT∂A2dW2H1)+tr(∂lT∂A2W2dH1)⎵dl2+tr(∂lT∂A2db21T)dl=tr\left ( \frac{\partial l^T}{\partial A_2} dA_2\right )=tr\left ( \frac{\partial l^T}{\partial A_2} dW_2H_1 \right ) + \underbrace{tr\left ( \frac{\partial l^T}{\partial A_2} W_2dH_1 \right )}_{dl_2} + tr\left ( \frac{\partial l^T}{\partial A_2} d\boldsymbol b_2\boldsymbol 1^T \right )dl=tr(A2lTdA2)=tr(A2lTdW2H1)+dl2tr(A2lTW2dH1)+tr(A2lTdb21T),从第一项得到∂l∂W2=∂l∂A2H1T\frac{\partial l}{\partial W_2}=\frac{\partial l}{\partial A_2}H_1^TW2l=A2lH1T,从第二项得到∂l∂W1=W2T∂l∂A2\frac{\partial l}{\partial W_1}=W_2^T\frac{\partial l}{\partial A_2}W1l=W2TA2l,从第三项得到到∂l∂b2=∂l∂A21\frac{\partial l}{\partial \boldsymbol b_2}=\frac{\partial l}{\partial A_2}\boldsymbol 1b2l=A2l1。接下来对第二项继续使用复合法则,得到∂l∂A1=∂l∂H1⊙σ′(A1)\frac{\partial l}{\partial A_1}=\frac{\partial l}{\partial H_1} \odot \sigma'(A_1)A1l=H1lσ(A1)。为求∂l∂W1,∂l∂b1\frac{\partial l}{\partial W_1},\frac{\partial l}{\partial \boldsymbol b_1}W1l,b1l,再用一次复合法则:dl2=tr(∂lT∂A1dA1)=tr(∂lT∂A1dW1X)+tr(∂lT∂A1db11T)dl_2=tr\left ( \frac{\partial l^T}{\partial A_1}dA_1 \right )=tr\left ( \frac{\partial l^T}{\partial A_1}dW_1X \right )+tr\left ( \frac{\partial l^T}{\partial A_1}d\boldsymbol b_1\boldsymbol 1^T \right )dl2=tr(A1lTdA1)=tr(A1lTdW1X)+tr(A1lTdb11T),得到∂l∂W1=∂l∂A1XT,∂l∂b1=∂l∂A11\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial A_1}X^T,\frac{\partial l}{\partial \boldsymbol b_1}=\frac{\partial l}{\partial A_1}\boldsymbol 1W1l=A1lXT,b1l=A1l1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值