矩阵微积分补充
约定1:
令y=f(x)\mathbf y=f(\mathbf x)y=f(x),其中y\mathbf yy是含有m个元素的向量,x\mathbf xx是含有n个元素的向量,则:
∂y∂x=[∂y1∂x1∂y1∂x2⋯∂y1∂xn∂y2∂x1∂y2∂x2⋯∂y2∂xn⋮⋮⋱⋮∂ym∂x1∂ym∂x2⋯∂ym∂xn]
\frac{\partial \mathbf y}{\partial \mathbf x} =\left[
\begin{matrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\
\end{matrix}
\right]
∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
这个m×nm \times nm×n的矩阵用来表示为由x\mathbf xx到y\mathbf yy的偏导数(y\mathbf yy对x\mathbf xx求偏导)。这种矩阵我们称为雅克比矩阵。
注意:如果x\mathbf xx是一个标量,那么得到的雅克比矩阵实际上是一个m×1m \times 1m×1的列向量。如果y\mathbf yy是一个标量,那么得到的雅克比矩阵实际上是一个1×n1 \times n1×n的行向量。
命题1:
令
y=Ax
\mathbf y=A\mathbf x
y=Ax
其中y\mathbf yy是m×1m \times 1m×1的列向量,x\mathbf xx是n×1n\times 1n×1的列向量,AAA是m×nm \times nm×n的矩阵,并且AAA不依赖x\mathbf xx,则
∂y∂x=A
\frac{\partial \mathbf y}{\partial \mathbf x}=A
∂x∂y=A
证明:
对于y\mathbf yy的第i个元素:
yi=∑k=1naikxk
y_i=\sum^n_{k=1}a_{ik}x_k
yi=k=1∑naikxk
显然我们可以得到:
∂yi∂xj=aij
\frac{\partial y_i}{\partial x_j}=a_{ij}
∂xj∂yi=aij
对于所有的i=1,2,⋯ ,m, j=1,2,⋯ ,ni=1,2,\cdots,m,\ j=1,2,\cdots,ni=1,2,⋯,m, j=1,2,⋯,n有:
∂y∂x=A
\frac{\partial \mathbf y}{\partial \mathbf x}=A
∂x∂y=A
命题2:
令
y=Ax
\mathbf y=A\mathbf x
y=Ax
其中y\mathbf yy是m×1m \times 1m×1的列向量,x\mathbf xx是n×1n\times 1n×1的列向量,AAA是m×nm \times nm×n的矩阵,AAA不依赖x\mathbf xx,并且我们假设x\mathbf xx是关于向量z\mathbf zz的函数,则
∂y∂x=A∂x∂z
\frac{\partial \mathbf y}{\partial \mathbf x}=A\frac{\partial \mathbf x}{\partial \mathbf z}
∂x∂y=A∂z∂x
证明:
对于y\mathbf yy的第i个元素:
yi=∑k=1naikxk
y_i=\sum^n_{k=1}a_{ik}x_k
yi=k=1∑naikxk
于是我们可以得到:
∂yi∂zj=∑k=inaik∂xk∂zj
\frac{\partial y_i}{\partial z_j}=\sum^n_{k=i}a_{ik}\frac{\partial x_k}{\partial z_j}
∂zj∂yi=k=i∑naik∂zj∂xk
我们可以发现这只是A∂x/∂zA{\partial \mathbf x}/{\partial \mathbf z}A∂x/∂z的第(i,j)(i,j)(i,j)元素,因此我们可以得到:
∂y∂z=∂y∂x∂x∂z=A∂y∂z
\frac {\partial \mathbf y}{\partial \mathbf z}=
\frac {\partial \mathbf y}{\partial \mathbf x}
\frac {\partial \mathbf x}{\partial \mathbf z}=
A\frac {\partial \mathbf y}{\partial \mathbf z}
∂z∂y=∂x∂y∂z∂x=A∂z∂y
命题3:
令标量ααα定义如下:
α=yTAx
α=\mathbf y^TA \mathbf x
α=yTAx
其中y\mathbf yy是m×1m \times 1m×1的列向量,x\mathbf xx是n×1n\times 1n×1的列向量,AAA是m×nm \times nm×n的矩阵,并且AAA不依赖x,y\mathbf x,\mathbf yx,y,则:
∂α∂x=yTA
\frac {\partial α}{\partial \mathbf x}=\mathbf y^TA
∂x∂α=yTA
并且:
∂α∂y=xTAT
\frac {\partial α}{\partial \mathbf y}=\mathbf x^TA^T
∂y∂α=xTAT
证明:
我们不妨令:
wT=yTA
\mathbf w^T=\mathbf y^TA
wT=yTA
并且我们将ααα写作:
α=wTx
α=\mathbf w^T \mathbf x
α=wTx
由命题1我们可以得到:
∂α∂x=wT=yTA
\frac {\partial α}{\partial \mathbf x}=\mathbf w^T=\mathbf y^TA
∂x∂α=wT=yTA
这是结果一。又因为ααα是标量,所以:
α=αT=xTATy
α=α^T=\mathbf x^TA^T\mathbf y
α=αT=xTATy
再次使用命题1,我们可以得到:
∂α∂y=xTAT
\frac {\partial α}{\partial \mathbf y}=\mathbf x^TA^T
∂y∂α=xTAT
命题4:
对于标量ααα为二次型的特殊情况,ααα写作如下形式:
α=xTAx
α=\mathbf x^TA\mathbf x
α=xTAx
其中x\mathbf xx是n×1n\times 1n×1的列向量,AAA是n×nn \times nn×n的矩阵,并且AAA不依赖x\mathbf xx,则:
∂α∂x=xT(A+AT)
\frac {\partial α}{\partial \mathbf x}=\mathbf x^T(A+A^T)
∂x∂α=xT(A+AT)
证明:
由定义可知:
α=∑j=1n∑i=1naijxixj
α=\sum^n_{j=1}\sum^n_{i=1}a_{ij}x_ix_j
α=j=1∑ni=1∑naijxixj
关于x\mathbf xx的第k个元素的微分:
∂α∂xk=∑j=1nakjxj+∑i=1naikxi
\frac{\partial α}{\partial x_k}=\sum^n_{j=1}a_{kj}x_j+\sum^n_{i=1}a_{ik}x_i
∂xk∂α=j=1∑nakjxj+i=1∑naikxi
于是:
∂α∂x=xTAT+xTA=xT(AT+A)
\frac{\partial α}{\partial \mathbf x}=\mathbf x^TA^T+\mathbf x^TA=\mathbf x^T(A^T+A)
∂x∂α=xTAT+xTA=xT(AT+A)
注意:此处的结论与第4节中的结论略有不同,第4章结论:
∂α∂x=(AT+A)x \frac{\partial α}{\partial \mathbf x}=(A^T+A) \mathbf x ∂x∂α=(AT+A)x
可以发现这两个结论只是相差一个转置而已:
((AT+A)x)T=xT(A+AT)=xT(AT+A) ((A^T+A) \mathbf x)^T=\mathbf x^T(A+A^T)=\mathbf x^T(A^T+A) ((AT+A)x)T=xT(A+AT)=xT(AT+A)
这是因为这里偏微分后的结果是个向量,对于向量中的单个元素而言,转置只是横着摆和竖着摆的区别((AT+A)x(A^T+A) \mathbf x(AT+A)x是列向量,xT(A+AT)\mathbf x^T(A+A^T)xT(A+AT)是行向量),从本质上来说并无区别。但是,为何会产生这种差异?
通过上下文我们可以发现,在第4章中,对矩阵的偏微分结果是依赖于变量向量(矩阵)的形态:
∇Af(A)∈Rm×n=[∂f(A)∂A11∂f(A)∂A12⋯∂f(A)∂A1n∂f(A)∂A21∂f(A)∂A22⋯∂f(A)∂A2n⋮⋮⋱⋮∂f(A)∂Am1∂f(A)∂Am2⋯∂f(A)∂Amn] \nabla_Af(A) \in \R^{m \times n} =\left[ \begin{matrix} \frac{\partial f(A)}{\partial A_{11}} & \frac{\partial f(A)}{\partial A_{12}} &\cdots& \frac{\partial f(A)}{\partial A_{1n}}\\ \frac{\partial f(A)}{\partial A_{21}} & \frac{\partial f(A)}{\partial A_{22}} &\cdots& \frac{\partial f(A)}{\partial A_{2n}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial f(A)}{\partial A_{m1}} & \frac{\partial f(A)}{\partial A_{m2}} &\cdots& \frac{\partial f(A)}{\partial A_{mn}}\\ \end{matrix} \right] ∇Af(A)∈Rm×n=⎣⎢⎢⎢⎢⎡∂A11∂f(A)∂A21∂f(A)⋮∂Am1∂f(A)∂A12∂f(A)∂A22∂f(A)⋮∂Am2∂f(A)⋯⋯⋱⋯∂A1n∂f(A)∂A2n∂f(A)⋮∂Amn∂f(A)⎦⎥⎥⎥⎥⎤
但是在此补充当中,偏微分的结果始终应该是遵从雅克比矩阵:
∂y∂x=[∂y1∂x1∂y1∂x2⋯∂y1∂xn∂y2∂x1∂y2∂x2⋯∂y2∂xn⋮⋮⋱⋮∂ym∂x1∂ym∂x2⋯∂ym∂xn] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
所以按照第4章的定义,标量对列向量求导结果应当为列向量。按照此补充中的雅克比矩阵定义,标量对向量求导应该为行向量。所以才会产生一个转置的差异。
实际上,在矩阵微积分中,矩阵的求导很多方面并没有统一的符号和表达方式。但是我们大致可以分为两类布局:
- 分子布局
- 分母布局
- 分子布局
将:
∂y∂x \frac{\partial \mathbf y}{\partial \mathbf x} ∂x∂y
中的分子向量y\mathbf yy当做列向量,分母向量x\mathbf xx当做行向量处理(因为对于单个向量而言并没有行列之分,行列只是人为的规定)。得到结果就是雅克比矩阵:
∂y∂x=[∂y1∂x1∂y1∂x2⋯∂y1∂xn∂y2∂x1∂y2∂x2⋯∂y2∂xn⋮⋮⋱⋮∂ym∂x1∂ym∂x2⋯∂ym∂xn] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
如果将分子向量y\mathbf yy退化为标量yyy:
∂y∂x=[∂y∂x1∂y∂x2⋯∂y∂xn] \frac{\partial y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} &\cdots& \frac{\partial y}{\partial x_n} \end{matrix} \right] ∂x∂y=[∂x1∂y∂x2∂y⋯∂xn∂y]
如果将分母向量x\mathbf xx退化为标量xxx:
∂y∂x=[∂y1∂x∂y2∂x⋮∂ym∂x] \frac{\partial \mathbf y}{\partial x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_m}{\partial x} \\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎡∂x∂y1∂x∂y2⋮∂x∂ym⎦⎥⎥⎥⎤
下面这中情况,只存在与分子布局:分子为矩阵YYY,分母为标量xxx:
∂Y∂x=[∂y11∂x∂y12∂x⋯∂y1n∂x∂y21∂x∂y22∂x⋯∂y2n∂x⋮⋮⋱⋮∂ym1∂x∂ym2∂x⋯∂ymn∂x] \frac{\partial Y}{\partial x} =\left[ \begin{matrix} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} &\cdots& \frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} &\cdots& \frac{\partial y_{2n}}{\partial x}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_{m1}}{\partial x} & \frac{\partial y_{m2}}{\partial x} &\cdots& \frac{\partial y_{mn}}{\partial x}\\ \end{matrix} \right] ∂x∂Y=⎣⎢⎢⎢⎡∂x∂y11∂x∂y21⋮∂x∂ym1∂x∂y12∂x∂y22⋮∂x∂ym2⋯⋯⋱⋯∂x∂y1n∂x∂y2n⋮∂x∂ymn⎦⎥⎥⎥⎤
- 分母布局
将:
∂y∂x \frac{\partial \mathbf y}{\partial \mathbf x} ∂x∂y
中的分子向量y\mathbf yy当做行向量,分母向量x\mathbf xx当做列向量处理(因为对于单个向量而言并没有行列之分,行列只是人为的规定)。得到结果就是:
∂y∂x=[∂y1∂x1∂y2∂x1⋯∂ym∂x1∂y1∂x2∂y2∂x2⋯∂ym∂x2⋮⋮⋱⋮∂y1∂xn∂y2∂xn⋯∂ym∂xn] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} &\cdots& \frac{\partial y_m}{\partial x_1}\\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_2}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x2∂y1⋮∂xn∂y1∂x1∂y2∂x2∂y2⋮∂xn∂y2⋯⋯⋱⋯∂x1∂ym∂x2∂ym⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
如果将分子向量y\mathbf yy退化为标量yyy:
∂y∂x=[∂y∂x1∂y∂x2⋮∂y∂xn] \frac{\partial \mathbf y}{\partial x} =\left[ \begin{matrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y∂x2∂y⋮∂xn∂y⎦⎥⎥⎥⎥⎤
如果将分母向量x\mathbf xx退化为标量xxx:
∂y∂x=[∂y1∂x∂y2∂x⋯∂ym∂x] \frac{\partial y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} &\cdots& \frac{\partial y_m}{\partial x} \end{matrix} \right] ∂x∂y=[∂x∂y1∂x∂y2⋯∂x∂ym]
下面这中情况,只存在与分母布局:分子为标量yyy,分母为矩阵XXX:
∂y∂X=[∂y∂x11∂y∂x12⋯∂y∂x1n∂y∂x21∂y∂x22⋯∂y∂x2n⋮⋮⋱⋮∂y∂xm1∂y∂xm2⋯∂y∂xmn] \frac{\partial y}{\partial X} =\left[ \begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} &\cdots& \frac{\partial y}{\partial x_{1n}}\\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} &\cdots& \frac{\partial y}{\partial x_{2n}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y}{\partial x_{m1}} & \frac{\partial y}{\partial x_{m2}} &\cdots& \frac{\partial y}{\partial x_{mn}}\\ \end{matrix} \right] ∂X∂y=⎣⎢⎢⎢⎢⎡∂x11∂y∂x21∂y⋮∂xm1∂y∂x12∂y∂x22∂y⋮∂xm2∂y⋯⋯⋱⋯∂x1n∂y∂x2n∂y⋮∂xmn∂y⎦⎥⎥⎥⎥⎤
可以发现这种分母布局便是第4章所提到的梯度.通过观察可以发现,分子布局和分母布局在表达形式上只是相差一个转置而已。
对于以上两种布局我们可以总结为:什么布局,什么为列,什么布局,什么不变
例如:分子布局,分子为列(分子看做列向量),分子布局,分子不变(求导后的矩阵每行的分子都是相同不变的)
但是在实际使用中,最初就会规定x,y\mathbf x,\mathbf yx,y是列向量,或者行向量(以下默认向量为列向量),则:
分子布局:
∂x∂xT=[10⋯001⋯0⋮⋮⋱⋮00⋯1]=I \frac{\partial \mathbf x}{\partial \mathbf x^T} =\left[ \begin{matrix} 1 & 0 &\cdots& 0\\ 0 & 1&\cdots& 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 &0 &\cdots& 1\\ \end{matrix} \right]=I ∂xT∂x=⎣⎢⎢⎢⎡10⋮001⋮0⋯⋯⋱⋯00⋮1⎦⎥⎥⎥⎤=I
分母布局:
∂xT∂x=[10⋯001⋯0⋮⋮⋱⋮00⋯1]=I \frac{\partial \mathbf x^T}{\partial \mathbf x} =\left[ \begin{matrix} 1 & 0 &\cdots& 0\\ 0 & 1&\cdots& 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 &0 &\cdots& 1\\ \end{matrix} \right]=I ∂x∂xT=⎣⎢⎢⎢⎡10⋮001⋮0⋯⋯⋱⋯00⋮1⎦⎥⎥⎥⎤=I
(需要注意的是列对列,行对行求导我们这里不讨论。)写到这里刚好解决我这段时间的一大困惑:
y=xTA \mathbf y = \mathbf x^TA y=xTA
其中x\mathbf xx是n×1n \times 1n×1的列向量,AAA是n×mn \times mn×m的矩阵,明显y\mathbf yy是1×m1 \times m1×m的行向量,于是我们可以得出(分母布局):
∂y∂x=∂xTA∂x=∂[∑i=1nai1xi∑i=1nai2xi⋯∑i=1naimxi]∂[x1x2⋮xn]=[a11a12⋯a1ma21a22⋯a2m⋮⋮⋱⋮an1an2⋯anm]=A \begin{aligned} \frac{\partial \mathbf y}{\partial \mathbf x}&=\frac{\partial \mathbf x^TA}{\partial \mathbf x}\\ &=\frac{\partial {\left[ \begin{matrix} \sum_{i=1}^na_{i1}x_i& \sum_{i=1}^na_{i2}x_i& \cdots & \sum_{i=1}^na_{im}x_i \end{matrix} \right]}}{\partial {\left[ \begin{matrix} x_1\\ x_2\\ \vdots \\ x_n\\ \end{matrix} \right]}}\\ &=\left[ \begin{matrix} a_{11} & a_{12} &\cdots& a_{1m}\\ a_{21} & a_{22}&\cdots& a_{2m}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} &a_{n2} &\cdots& a_{nm}\\ \end{matrix} \right] \\&=A \end{aligned} ∂x∂y=∂x∂xTA=∂⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤∂[∑i=1nai1xi∑i=1nai2xi⋯∑i=1naimxi]=⎣⎢⎢⎢⎡a11a21⋮an1a12a22⋮an2⋯⋯⋱⋯a1ma2m⋮anm⎦⎥⎥⎥⎤=A
于是我便思考如果是y\mathbf yy的转置yT\mathbf y^TyT对x\mathbf xx的求导的结果是否存在某种关联,但是现在我发现yTy^TyT是列向量,x\mathbf xx也是列向量,列向量对列向量求导依旧是列向量(也就是矩阵A的向量化vec(A)vec(A)vec(A)),会改变现有A矩阵的形式。所以我们应该写成如下形式(分子布局):
∂yT∂xT=∂ATx∂xT=∂[∑i=1nai1xi∑i=1nai2xi⋮∑i=1naimxi]∂[x1x2⋯xn]=[a11a21⋯an1a12a22⋯an2⋮⋮⋱⋮a1ma2m⋯anm]=AT \begin{aligned} \frac{\partial \mathbf y^T}{\partial \mathbf x^T}&=\frac{\partial A^T\mathbf x}{\partial \mathbf x^T}\\ &=\frac{\partial {\left[ \begin{matrix} \sum_{i=1}^na_{i1}x_i\\ \sum_{i=1}^na_{i2}x_i\\ \vdots \\ \sum_{i=1}^na_{im}x_i \end{matrix} \right]}}{\partial {\left[ \begin{matrix} x_1& x_2& \cdots & x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{11} & a_{21} &\cdots& a_{n1}\\ a_{12} & a_{22}&\cdots& a_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1m} &a_{2m} &\cdots& a_{nm}\\ \end{matrix} \right]\\ \\&=A^T \end{aligned} ∂xT∂yT=∂xT∂ATx=∂[x1x2⋯xn]∂⎣⎢⎢⎢⎡∑i=1nai1xi∑i=1nai2xi⋮∑i=1naimxi⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡a11a12⋮a1ma21a22⋮a2m⋯⋯⋱⋯an1an2⋮anm⎦⎥⎥⎥⎤=AT
我们可以发现分子布局和分母布局只是相差一个矩阵而已,即
(∂y∂x)T=∂yT∂xT \left(\frac{\partial \mathbf y}{\partial \mathbf x}\right)^T=\frac{\partial \mathbf y^T}{\partial \mathbf x^T} (∂x∂y)T=∂xT∂yT
那么又如果y\mathbf yy退化为一个标量yyy,又是如何?不妨令
y=xTa y=\mathbf x^T \mathbf a y=xTa
其中x\mathbf xx是n×1n \times 1n×1的列向量,a\mathbf aa是n×1n \times 1n×1的列向量,显然yyy是一个标量,则(分子布局):
∂y∂x=∂xTa∂x=∂(∑i=1naixi)∂[x1x2⋮xn]=[a1a2⋮an]=a \begin{aligned} \frac{\partial y}{\partial \mathbf x}&=\frac{\partial \mathbf x^T\mathbf a}{\partial \mathbf x}\\ &=\frac{\partial {\left( \sum^n_{i=1}a_{i}x_i \right) }}{\partial {\left[ \begin{matrix} x_1\\ x_2\\ \vdots \\ x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{1} \\ a_{2} \\ \vdots \\ a_{n} \\ \end{matrix} \right]\\ \\&=\mathbf a \end{aligned} ∂x∂y=∂x∂xTa=∂⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤∂(∑i=1naixi)=⎣⎢⎢⎢⎡a1a2⋮an⎦⎥⎥⎥⎤=a
由于yyy是一个标量,所以有:
y=(y)T=aTx y=(y)^T=\mathbf a^T \mathbf x y=(y)T=aTx
于是我们可以得到:
∂aTx∂x=a \frac{\partial \mathbf a^T\mathbf x}{\partial \mathbf x}=\mathbf a ∂x∂aTx=a
如果标量yTy^TyT是对xT\mathbf x^TxT求导呢(分母布局)?
∂yT∂xT=∂aTx∂xT=∂(∑i=1naixi)∂[x1x2⋯xn]=[a1a2⋯an]=aT \begin{aligned} \frac{\partial y^T}{\partial \mathbf x^T}&=\frac{\partial \mathbf a^T\mathbf x}{\partial \mathbf x^T}\\ &=\frac{\partial {\left( \sum^n_{i=1}a_{i}x_i \right) }}{\partial {\left[ \begin{matrix} x_1& x_2& \cdots & x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{1} & a_{2} & \cdots & a_{n} \end{matrix} \right]\\ \\&=\mathbf a^T \end{aligned} ∂xT∂yT=∂xT∂aTx=∂[x1x2⋯xn]∂(∑i=1naixi)=[a1a2⋯an]=aT
命题5:
假设命题4中的A是对称矩阵
α=xTAx
α=\mathbf x^TA\mathbf x
α=xTAx
其中x\mathbf xx是n×1n\times 1n×1的列向量,AAA是n×nn \times nn×n的矩阵,并且AAA不依赖x\mathbf xx,则:
∂α∂x=2xTA
\frac {\partial α}{\partial \mathbf x}=\mathbf 2x^TA
∂x∂α=2xTA
证明:由命题4即可证明
命题6:
假设标量ααα为
α=yTx
α=\mathbf y^T\mathbf x
α=yTx
其中x\mathbf xx是n×1n\times 1n×1的列向量,y\mathbf yy是n×1n\times 1n×1的列向量,并且x,y\mathbf x,\mathbf yx,y都是关于向量z\mathbf zz的函数,则:
∂α∂z=xT∂y∂z+yT∂x∂z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT∂z∂y+yT∂z∂x
证明:
α=∑i=1Nxiyi
α=\sum^N_{i=1}x_iy_i
α=i=1∑Nxiyi
向量z\mathbf zz的第k个元素的微分:
∂α∂zk=∑i=1n(xi∂yi∂zk+yi∂xi∂zk)
\frac {\partial α}{\partial \mathbf z_k}=\sum^n_{i=1} \left(x_i\frac {\partial y_i}{\partial z_k}+y_i\frac {\partial x_i}{\partial z_k}\right)
∂zk∂α=i=1∑n(xi∂zk∂yi+yi∂zk∂xi)
所以我们可以得出;
∂α∂z=∂α∂y∂y∂z+∂α∂x∂x∂z=xT∂y∂z+yT∂x∂z
\frac {\partial α}{\partial \mathbf z}=
\frac {\partial α}{\partial \mathbf y}\frac {\partial \mathbf y}{\partial \mathbf z}+\frac {\partial α}{\partial \mathbf x}\frac {\partial \mathbf x}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=∂y∂α∂z∂y+∂x∂α∂z∂x=xT∂z∂y+yT∂z∂x
命题7
假设标量ααα为
α=xTx
α=\mathbf x^T\mathbf x
α=xTx
其中x\mathbf xx是n×1n\times 1n×1的列向量,并且x\mathbf xx是关于向量z\mathbf zz的函数,则:
∂α∂z=2xT∂x∂z
\frac {\partial α}{\partial \mathbf z}=2\mathbf x^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=2xT∂z∂x
证明:由结论6证明
命题8
假设标量ααα为
α=yTAx
α=\mathbf y^TA\mathbf x
α=yTAx
其中x\mathbf xx是n×1n\times 1n×1的列向量,y\mathbf yy是m×1m\times 1m×1的列向量,AAA是n×nn \times nn×n的矩阵,AAA不依赖z\mathbf zz,并且x,y\mathbf x,\mathbf yx,y是关于向量z\mathbf zz的函数,则:
∂α∂z=xTAT∂y∂z+yTA∂x∂z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^TA^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xTAT∂z∂y+yTA∂z∂x
证明:令:
wT=yTA
\mathbf w^T=\mathbf y^TA
wT=yTA
则ααα可以写作:
α=wTx
α=\mathbf w^T \mathbf x
α=wTx
由结论6,我们可以得到:
∂α∂z=xT∂w∂z+wT∂x∂z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf w}{\partial \mathbf z}+\mathbf w^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT∂z∂w+wT∂z∂x
我们在将w\mathbf ww带回到式子中:
∂α∂z=xT∂(ATy)∂z+yTA∂x∂z=xTAT∂y∂z+yTA∂x∂z
\begin{aligned}
\frac {\partial α}{\partial \mathbf z}&=\mathbf x^T\frac {\partial ( A^T \mathbf y)}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}\\
&=\mathbf x^TA^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}\\
\end{aligned}
∂z∂α=xT∂z∂(ATy)+yTA∂z∂x=xTAT∂z∂y+yTA∂z∂x
命题9
假设标量ααα为
α=xTAx
α=\mathbf x^TA\mathbf x
α=xTAx
其中x\mathbf xx是n×1n\times 1n×1的列向量,AAA是n×nn \times nn×n的矩阵,AAA不依赖z\mathbf zz,并且x\mathbf xx是关于向量z\mathbf zz的函数,则:
∂α∂z=xT(A+AT)∂x∂z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T(A+A^T)\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT(A+AT)∂z∂x
证明:有结论8得出
命题10
假设标量ααα为,其中A为对称矩阵
α=xTAx
α=\mathbf x^TA\mathbf x
α=xTAx
其中x\mathbf xx是n×1n\times 1n×1的列向量,AAA是n×nn \times nn×n的矩阵,AAA不依赖z\mathbf zz,并且x\mathbf xx是关于向量z\mathbf zz的函数,则:
∂α∂z=2xTA∂x∂z
\frac {\partial α}{\partial \mathbf z}=2\mathbf x^TA\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=2xTA∂z∂x
证明:有结论9得出
命题11:
如果A是一个m×mm\times mm×m的可逆矩阵,那么A对标量ααα的偏微分是:
∂A−1∂α=−A−1∂A∂αA−1
\frac {\partial A^{-1}}{\partial α}=-A^{-1}\frac {\partial A}{\partial α}A^{-1}
∂α∂A−1=−A−1∂α∂AA−1
证明:由定义可知
A−1A=I
A^{-1}A=I
A−1A=I
等式两边对标量ααα微分:
A−1∂A∂α+∂A∂αA=0
A^{-1}\frac {\partial A}{\partial α}+\frac {\partial A}{\partial α}A=0
A−1∂α∂A+∂α∂AA=0
移项:
∂A−1∂α=−A−1∂A∂αA−1
\frac {\partial A^{-1}}{\partial α}=-A^{-1}\frac {\partial A}{\partial α}A^{-1}
∂α∂A−1=−A−1∂α∂AA−1