矩阵微积分补充
约定1:
令
y
=
f
(
x
)
\mathbf y=f(\mathbf x)
y=f(x),其中
y
\mathbf y
y是含有m个元素的向量,
x
\mathbf x
x是含有n个元素的向量,则:
∂
y
∂
x
=
[
∂
y
1
∂
x
1
∂
y
1
∂
x
2
⋯
∂
y
1
∂
x
n
∂
y
2
∂
x
1
∂
y
2
∂
x
2
⋯
∂
y
2
∂
x
n
⋮
⋮
⋱
⋮
∂
y
m
∂
x
1
∂
y
m
∂
x
2
⋯
∂
y
m
∂
x
n
]
\frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right]
∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
这个
m
×
n
m \times n
m×n的矩阵用来表示为由
x
\mathbf x
x到
y
\mathbf y
y的偏导数(
y
\mathbf y
y对
x
\mathbf x
x求偏导)。这种矩阵我们称为雅克比矩阵。
注意:如果 x \mathbf x x是一个标量,那么得到的雅克比矩阵实际上是一个 m × 1 m \times 1 m×1的列向量。如果 y \mathbf y y是一个标量,那么得到的雅克比矩阵实际上是一个 1 × n 1 \times n 1×n的行向量。
命题1:
令
y
=
A
x
\mathbf y=A\mathbf x
y=Ax
其中
y
\mathbf y
y是
m
×
1
m \times 1
m×1的列向量,
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
m
×
n
m \times n
m×n的矩阵,并且
A
A
A不依赖
x
\mathbf x
x,则
∂
y
∂
x
=
A
\frac{\partial \mathbf y}{\partial \mathbf x}=A
∂x∂y=A
证明:
对于
y
\mathbf y
y的第i个元素:
y
i
=
∑
k
=
1
n
a
i
k
x
k
y_i=\sum^n_{k=1}a_{ik}x_k
yi=k=1∑naikxk
显然我们可以得到:
∂
y
i
∂
x
j
=
a
i
j
\frac{\partial y_i}{\partial x_j}=a_{ij}
∂xj∂yi=aij
对于所有的
i
=
1
,
2
,
⋯
 
,
m
,
j
=
1
,
2
,
⋯
 
,
n
i=1,2,\cdots,m,\ j=1,2,\cdots,n
i=1,2,⋯,m, j=1,2,⋯,n有:
∂
y
∂
x
=
A
\frac{\partial \mathbf y}{\partial \mathbf x}=A
∂x∂y=A
命题2:
令
y
=
A
x
\mathbf y=A\mathbf x
y=Ax
其中
y
\mathbf y
y是
m
×
1
m \times 1
m×1的列向量,
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
m
×
n
m \times n
m×n的矩阵,
A
A
A不依赖
x
\mathbf x
x,并且我们假设
x
\mathbf x
x是关于向量
z
\mathbf z
z的函数,则
∂
y
∂
x
=
A
∂
x
∂
z
\frac{\partial \mathbf y}{\partial \mathbf x}=A\frac{\partial \mathbf x}{\partial \mathbf z}
∂x∂y=A∂z∂x
证明:
对于
y
\mathbf y
y的第i个元素:
y
i
=
∑
k
=
1
n
a
i
k
x
k
y_i=\sum^n_{k=1}a_{ik}x_k
yi=k=1∑naikxk
于是我们可以得到:
∂
y
i
∂
z
j
=
∑
k
=
i
n
a
i
k
∂
x
k
∂
z
j
\frac{\partial y_i}{\partial z_j}=\sum^n_{k=i}a_{ik}\frac{\partial x_k}{\partial z_j}
∂zj∂yi=k=i∑naik∂zj∂xk
我们可以发现这只是
A
∂
x
/
∂
z
A{\partial \mathbf x}/{\partial \mathbf z}
A∂x/∂z的第
(
i
,
j
)
(i,j)
(i,j)元素,因此我们可以得到:
∂
y
∂
z
=
∂
y
∂
x
∂
x
∂
z
=
A
∂
y
∂
z
\frac {\partial \mathbf y}{\partial \mathbf z}= \frac {\partial \mathbf y}{\partial \mathbf x} \frac {\partial \mathbf x}{\partial \mathbf z}= A\frac {\partial \mathbf y}{\partial \mathbf z}
∂z∂y=∂x∂y∂z∂x=A∂z∂y
命题3:
令标量
α
α
α定义如下:
α
=
y
T
A
x
α=\mathbf y^TA \mathbf x
α=yTAx
其中
y
\mathbf y
y是
m
×
1
m \times 1
m×1的列向量,
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
m
×
n
m \times n
m×n的矩阵,并且
A
A
A不依赖
x
,
y
\mathbf x,\mathbf y
x,y,则:
∂
α
∂
x
=
y
T
A
\frac {\partial α}{\partial \mathbf x}=\mathbf y^TA
∂x∂α=yTA
并且:
∂
α
∂
y
=
x
T
A
T
\frac {\partial α}{\partial \mathbf y}=\mathbf x^TA^T
∂y∂α=xTAT
证明:
我们不妨令:
w
T
=
y
T
A
\mathbf w^T=\mathbf y^TA
wT=yTA
并且我们将
α
α
α写作:
α
=
w
T
x
α=\mathbf w^T \mathbf x
α=wTx
由命题1我们可以得到:
∂
α
∂
x
=
w
T
=
y
T
A
\frac {\partial α}{\partial \mathbf x}=\mathbf w^T=\mathbf y^TA
∂x∂α=wT=yTA
这是结果一。又因为
α
α
α是标量,所以:
α
=
α
T
=
x
T
A
T
y
α=α^T=\mathbf x^TA^T\mathbf y
α=αT=xTATy
再次使用命题1,我们可以得到:
∂
α
∂
y
=
x
T
A
T
\frac {\partial α}{\partial \mathbf y}=\mathbf x^TA^T
∂y∂α=xTAT
命题4:
对于标量
α
α
α为二次型的特殊情况,
α
α
α写作如下形式:
α
=
x
T
A
x
α=\mathbf x^TA\mathbf x
α=xTAx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
n
×
n
n \times n
n×n的矩阵,并且
A
A
A不依赖
x
\mathbf x
x,则:
∂
α
∂
x
=
x
T
(
A
+
A
T
)
\frac {\partial α}{\partial \mathbf x}=\mathbf x^T(A+A^T)
∂x∂α=xT(A+AT)
证明:
由定义可知:
α
=
∑
j
=
1
n
∑
i
=
1
n
a
i
j
x
i
x
j
α=\sum^n_{j=1}\sum^n_{i=1}a_{ij}x_ix_j
α=j=1∑ni=1∑naijxixj
关于
x
\mathbf x
x的第k个元素的微分:
∂
α
∂
x
k
=
∑
j
=
1
n
a
k
j
x
j
+
∑
i
=
1
n
a
i
k
x
i
\frac{\partial α}{\partial x_k}=\sum^n_{j=1}a_{kj}x_j+\sum^n_{i=1}a_{ik}x_i
∂xk∂α=j=1∑nakjxj+i=1∑naikxi
于是:
∂
α
∂
x
=
x
T
A
T
+
x
T
A
=
x
T
(
A
T
+
A
)
\frac{\partial α}{\partial \mathbf x}=\mathbf x^TA^T+\mathbf x^TA=\mathbf x^T(A^T+A)
∂x∂α=xTAT+xTA=xT(AT+A)
注意:此处的结论与第4节中的结论略有不同,第4章结论:
∂ α ∂ x = ( A T + A ) x \frac{\partial α}{\partial \mathbf x}=(A^T+A) \mathbf x ∂x∂α=(AT+A)x
可以发现这两个结论只是相差一个转置而已:
( ( A T + A ) x ) T = x T ( A + A T ) = x T ( A T + A ) ((A^T+A) \mathbf x)^T=\mathbf x^T(A+A^T)=\mathbf x^T(A^T+A) ((AT+A)x)T=xT(A+AT)=xT(AT+A)
这是因为这里偏微分后的结果是个向量,对于向量中的单个元素而言,转置只是横着摆和竖着摆的区别( ( A T + A ) x (A^T+A) \mathbf x (AT+A)x是列向量, x T ( A + A T ) \mathbf x^T(A+A^T) xT(A+AT)是行向量),从本质上来说并无区别。但是,为何会产生这种差异?
通过上下文我们可以发现,在第4章中,对矩阵的偏微分结果是依赖于变量向量(矩阵)的形态:
∇ A f ( A ) ∈ R m × n = [ ∂ f ( A ) ∂ A 11 ∂ f ( A ) ∂ A 12 ⋯ ∂ f ( A ) ∂ A 1 n ∂ f ( A ) ∂ A 21 ∂ f ( A ) ∂ A 22 ⋯ ∂ f ( A ) ∂ A 2 n ⋮ ⋮ ⋱ ⋮ ∂ f ( A ) ∂ A m 1 ∂ f ( A ) ∂ A m 2 ⋯ ∂ f ( A ) ∂ A m n ] \nabla_Af(A) \in \R^{m \times n} =\left[ \begin{matrix} \frac{\partial f(A)}{\partial A_{11}} & \frac{\partial f(A)}{\partial A_{12}} &\cdots& \frac{\partial f(A)}{\partial A_{1n}}\\ \frac{\partial f(A)}{\partial A_{21}} & \frac{\partial f(A)}{\partial A_{22}} &\cdots& \frac{\partial f(A)}{\partial A_{2n}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial f(A)}{\partial A_{m1}} & \frac{\partial f(A)}{\partial A_{m2}} &\cdots& \frac{\partial f(A)}{\partial A_{mn}}\\ \end{matrix} \right] ∇Af(A)∈Rm×n=⎣⎢⎢⎢⎢⎡∂A11∂f(A)∂A21∂f(A)⋮∂Am1∂f(A)∂A12∂f(A)∂A22∂f(A)⋮∂Am2∂f(A)⋯⋯⋱⋯∂A1n∂f(A)∂A2n∂f(A)⋮∂Amn∂f(A)⎦⎥⎥⎥⎥⎤
但是在此补充当中,偏微分的结果始终应该是遵从雅克比矩阵:
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
所以按照第4章的定义,标量对列向量求导结果应当为列向量。按照此补充中的雅克比矩阵定义,标量对向量求导应该为行向量。所以才会产生一个转置的差异。
实际上,在矩阵微积分中,矩阵的求导很多方面并没有统一的符号和表达方式。但是我们大致可以分为两类布局:
- 分子布局
- 分母布局
- 分子布局
将:
∂ y ∂ x \frac{\partial \mathbf y}{\partial \mathbf x} ∂x∂y
中的分子向量 y \mathbf y y当做列向量,分母向量 x \mathbf x x当做行向量处理(因为对于单个向量而言并没有行列之分,行列只是人为的规定)。得到结果就是雅克比矩阵:
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots& \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
如果将分子向量 y \mathbf y y退化为标量 y y y:
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ⋯ ∂ y ∂ x n ] \frac{\partial y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} &\cdots& \frac{\partial y}{\partial x_n} \end{matrix} \right] ∂x∂y=[∂x1∂y∂x2∂y⋯∂xn∂y]
如果将分母向量 x \mathbf x x退化为标量 x x x:
∂ y ∂ x = [ ∂ y 1 ∂ x ∂ y 2 ∂ x ⋮ ∂ y m ∂ x ] \frac{\partial \mathbf y}{\partial x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_m}{\partial x} \\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎡∂x∂y1∂x∂y2⋮∂x∂ym⎦⎥⎥⎥⎤
下面这中情况,只存在与分子布局:分子为矩阵 Y Y Y,分母为标量 x x x:
∂ Y ∂ x = [ ∂ y 11 ∂ x ∂ y 12 ∂ x ⋯ ∂ y 1 n ∂ x ∂ y 21 ∂ x ∂ y 22 ∂ x ⋯ ∂ y 2 n ∂ x ⋮ ⋮ ⋱ ⋮ ∂ y m 1 ∂ x ∂ y m 2 ∂ x ⋯ ∂ y m n ∂ x ] \frac{\partial Y}{\partial x} =\left[ \begin{matrix} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} &\cdots& \frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} &\cdots& \frac{\partial y_{2n}}{\partial x}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_{m1}}{\partial x} & \frac{\partial y_{m2}}{\partial x} &\cdots& \frac{\partial y_{mn}}{\partial x}\\ \end{matrix} \right] ∂x∂Y=⎣⎢⎢⎢⎡∂x∂y11∂x∂y21⋮∂x∂ym1∂x∂y12∂x∂y22⋮∂x∂ym2⋯⋯⋱⋯∂x∂y1n∂x∂y2n⋮∂x∂ymn⎦⎥⎥⎥⎤
- 分母布局
将:
∂ y ∂ x \frac{\partial \mathbf y}{\partial \mathbf x} ∂x∂y
中的分子向量 y \mathbf y y当做行向量,分母向量 x \mathbf x x当做列向量处理(因为对于单个向量而言并没有行列之分,行列只是人为的规定)。得到结果就是:
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 ⋯ ∂ y m ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 ⋯ ∂ y m ∂ x 2 ⋮ ⋮ ⋱ ⋮ ∂ y 1 ∂ x n ∂ y 2 ∂ x n ⋯ ∂ y m ∂ x n ] \frac{\partial \mathbf y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} &\cdots& \frac{\partial y_m}{\partial x_1}\\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} &\cdots& \frac{\partial y_m}{\partial x_2}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} &\cdots& \frac{\partial y_m}{\partial x_n}\\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x2∂y1⋮∂xn∂y1∂x1∂y2∂x2∂y2⋮∂xn∂y2⋯⋯⋱⋯∂x1∂ym∂x2∂ym⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
如果将分子向量 y \mathbf y y退化为标量 y y y:
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ⋮ ∂ y ∂ x n ] \frac{\partial \mathbf y}{\partial x} =\left[ \begin{matrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \\ \end{matrix} \right] ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y∂x2∂y⋮∂xn∂y⎦⎥⎥⎥⎥⎤
如果将分母向量 x \mathbf x x退化为标量 x x x:
∂ y ∂ x = [ ∂ y 1 ∂ x ∂ y 2 ∂ x ⋯ ∂ y m ∂ x ] \frac{\partial y}{\partial \mathbf x} =\left[ \begin{matrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} &\cdots& \frac{\partial y_m}{\partial x} \end{matrix} \right] ∂x∂y=[∂x∂y1∂x∂y2⋯∂x∂ym]
下面这中情况,只存在与分母布局:分子为标量 y y y,分母为矩阵 X X X:
∂ y ∂ X = [ ∂ y ∂ x 11 ∂ y ∂ x 12 ⋯ ∂ y ∂ x 1 n ∂ y ∂ x 21 ∂ y ∂ x 22 ⋯ ∂ y ∂ x 2 n ⋮ ⋮ ⋱ ⋮ ∂ y ∂ x m 1 ∂ y ∂ x m 2 ⋯ ∂ y ∂ x m n ] \frac{\partial y}{\partial X} =\left[ \begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} &\cdots& \frac{\partial y}{\partial x_{1n}}\\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} &\cdots& \frac{\partial y}{\partial x_{2n}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y}{\partial x_{m1}} & \frac{\partial y}{\partial x_{m2}} &\cdots& \frac{\partial y}{\partial x_{mn}}\\ \end{matrix} \right] ∂X∂y=⎣⎢⎢⎢⎢⎡∂x11∂y∂x21∂y⋮∂xm1∂y∂x12∂y∂x22∂y⋮∂xm2∂y⋯⋯⋱⋯∂x1n∂y∂x2n∂y⋮∂xmn∂y⎦⎥⎥⎥⎥⎤
可以发现这种分母布局便是第4章所提到的梯度.通过观察可以发现,分子布局和分母布局在表达形式上只是相差一个转置而已。
对于以上两种布局我们可以总结为:什么布局,什么为列,什么布局,什么不变
例如:分子布局,分子为列(分子看做列向量),分子布局,分子不变(求导后的矩阵每行的分子都是相同不变的)
但是在实际使用中,最初就会规定 x , y \mathbf x,\mathbf y x,y是列向量,或者行向量(以下默认向量为列向量),则:
分子布局:
∂ x ∂ x T = [ 1 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ] = I \frac{\partial \mathbf x}{\partial \mathbf x^T} =\left[ \begin{matrix} 1 & 0 &\cdots& 0\\ 0 & 1&\cdots& 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 &0 &\cdots& 1\\ \end{matrix} \right]=I ∂xT∂x=⎣⎢⎢⎢⎡10⋮001⋮0⋯⋯⋱⋯00⋮1⎦⎥⎥⎥⎤=I
分母布局:
∂ x T ∂ x = [ 1 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ] = I \frac{\partial \mathbf x^T}{\partial \mathbf x} =\left[ \begin{matrix} 1 & 0 &\cdots& 0\\ 0 & 1&\cdots& 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 &0 &\cdots& 1\\ \end{matrix} \right]=I ∂x∂xT=⎣⎢⎢⎢⎡10⋮001⋮0⋯⋯⋱⋯00⋮1⎦⎥⎥⎥⎤=I
(需要注意的是列对列,行对行求导我们这里不讨论。)写到这里刚好解决我这段时间的一大困惑:
y = x T A \mathbf y = \mathbf x^TA y=xTA
其中 x \mathbf x x是 n × 1 n \times 1 n×1的列向量, A A A是 n × m n \times m n×m的矩阵,明显 y \mathbf y y是 1 × m 1 \times m 1×m的行向量,于是我们可以得出(分母布局):
∂ y ∂ x = ∂ x T A ∂ x = ∂ [ ∑ i = 1 n a i 1 x i ∑ i = 1 n a i 2 x i ⋯ ∑ i = 1 n a i m x i ] ∂ [ x 1 x 2 ⋮ x n ] = [ a 11 a 12 ⋯ a 1 m a 21 a 22 ⋯ a 2 m ⋮ ⋮ ⋱ ⋮ a n 1 a n 2 ⋯ a n m ] = A \begin{aligned} \frac{\partial \mathbf y}{\partial \mathbf x}&=\frac{\partial \mathbf x^TA}{\partial \mathbf x}\\ &=\frac{\partial {\left[ \begin{matrix} \sum_{i=1}^na_{i1}x_i& \sum_{i=1}^na_{i2}x_i& \cdots & \sum_{i=1}^na_{im}x_i \end{matrix} \right]}}{\partial {\left[ \begin{matrix} x_1\\ x_2\\ \vdots \\ x_n\\ \end{matrix} \right]}}\\ &=\left[ \begin{matrix} a_{11} & a_{12} &\cdots& a_{1m}\\ a_{21} & a_{22}&\cdots& a_{2m}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} &a_{n2} &\cdots& a_{nm}\\ \end{matrix} \right] \\&=A \end{aligned} ∂x∂y=∂x∂xTA=∂⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤∂[∑i=1nai1xi∑i=1nai2xi⋯∑i=1naimxi]=⎣⎢⎢⎢⎡a11a21⋮an1a12a22⋮an2⋯⋯⋱⋯a1ma2m⋮anm⎦⎥⎥⎥⎤=A
于是我便思考如果是 y \mathbf y y的转置 y T \mathbf y^T yT对 x \mathbf x x的求导的结果是否存在某种关联,但是现在我发现 y T y^T yT是列向量, x \mathbf x x也是列向量,列向量对列向量求导依旧是列向量(也就是矩阵A的向量化 v e c ( A ) vec(A) vec(A)),会改变现有A矩阵的形式。所以我们应该写成如下形式(分子布局):
∂ y T ∂ x T = ∂ A T x ∂ x T = ∂ [ ∑ i = 1 n a i 1 x i ∑ i = 1 n a i 2 x i ⋮ ∑ i = 1 n a i m x i ] ∂ [ x 1 x 2 ⋯ x n ] = [ a 11 a 21 ⋯ a n 1 a 12 a 22 ⋯ a n 2 ⋮ ⋮ ⋱ ⋮ a 1 m a 2 m ⋯ a n m ] = A T \begin{aligned} \frac{\partial \mathbf y^T}{\partial \mathbf x^T}&=\frac{\partial A^T\mathbf x}{\partial \mathbf x^T}\\ &=\frac{\partial {\left[ \begin{matrix} \sum_{i=1}^na_{i1}x_i\\ \sum_{i=1}^na_{i2}x_i\\ \vdots \\ \sum_{i=1}^na_{im}x_i \end{matrix} \right]}}{\partial {\left[ \begin{matrix} x_1& x_2& \cdots & x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{11} & a_{21} &\cdots& a_{n1}\\ a_{12} & a_{22}&\cdots& a_{n2}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1m} &a_{2m} &\cdots& a_{nm}\\ \end{matrix} \right]\\ \\&=A^T \end{aligned} ∂xT∂yT=∂xT∂ATx=∂[x1x2⋯xn]∂⎣⎢⎢⎢⎡∑i=1nai1xi∑i=1nai2xi⋮∑i=1naimxi⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡a11a12⋮a1ma21a22⋮a2m⋯⋯⋱⋯an1an2⋮anm⎦⎥⎥⎥⎤=AT
我们可以发现分子布局和分母布局只是相差一个矩阵而已,即
( ∂ y ∂ x ) T = ∂ y T ∂ x T \left(\frac{\partial \mathbf y}{\partial \mathbf x}\right)^T=\frac{\partial \mathbf y^T}{\partial \mathbf x^T} (∂x∂y)T=∂xT∂yT
那么又如果 y \mathbf y y退化为一个标量 y y y,又是如何?不妨令
y = x T a y=\mathbf x^T \mathbf a y=xTa
其中 x \mathbf x x是 n × 1 n \times 1 n×1的列向量, a \mathbf a a是 n × 1 n \times 1 n×1的列向量,显然 y y y是一个标量,则(分子布局):
∂ y ∂ x = ∂ x T a ∂ x = ∂ ( ∑ i = 1 n a i x i ) ∂ [ x 1 x 2 ⋮ x n ] = [ a 1 a 2 ⋮ a n ] = a \begin{aligned} \frac{\partial y}{\partial \mathbf x}&=\frac{\partial \mathbf x^T\mathbf a}{\partial \mathbf x}\\ &=\frac{\partial {\left( \sum^n_{i=1}a_{i}x_i \right) }}{\partial {\left[ \begin{matrix} x_1\\ x_2\\ \vdots \\ x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{1} \\ a_{2} \\ \vdots \\ a_{n} \\ \end{matrix} \right]\\ \\&=\mathbf a \end{aligned} ∂x∂y=∂x∂xTa=∂⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤∂(∑i=1naixi)=⎣⎢⎢⎢⎡a1a2⋮an⎦⎥⎥⎥⎤=a
由于 y y y是一个标量,所以有:
y = ( y ) T = a T x y=(y)^T=\mathbf a^T \mathbf x y=(y)T=aTx
于是我们可以得到:
∂ a T x ∂ x = a \frac{\partial \mathbf a^T\mathbf x}{\partial \mathbf x}=\mathbf a ∂x∂aTx=a
如果标量 y T y^T yT是对 x T \mathbf x^T xT求导呢(分母布局)?
∂ y T ∂ x T = ∂ a T x ∂ x T = ∂ ( ∑ i = 1 n a i x i ) ∂ [ x 1 x 2 ⋯ x n ] = [ a 1 a 2 ⋯ a n ] = a T \begin{aligned} \frac{\partial y^T}{\partial \mathbf x^T}&=\frac{\partial \mathbf a^T\mathbf x}{\partial \mathbf x^T}\\ &=\frac{\partial {\left( \sum^n_{i=1}a_{i}x_i \right) }}{\partial {\left[ \begin{matrix} x_1& x_2& \cdots & x_n \end{matrix} \right]}}\\\\ &=\left[ \begin{matrix} a_{1} & a_{2} & \cdots & a_{n} \end{matrix} \right]\\ \\&=\mathbf a^T \end{aligned} ∂xT∂yT=∂xT∂aTx=∂[x1x2⋯xn]∂(∑i=1naixi)=[a1a2⋯an]=aT
命题5:
假设命题4中的A是对称矩阵
α
=
x
T
A
x
α=\mathbf x^TA\mathbf x
α=xTAx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
n
×
n
n \times n
n×n的矩阵,并且
A
A
A不依赖
x
\mathbf x
x,则:
∂
α
∂
x
=
2
x
T
A
\frac {\partial α}{\partial \mathbf x}=\mathbf 2x^TA
∂x∂α=2xTA
证明:由命题4即可证明
命题6:
假设标量
α
α
α为
α
=
y
T
x
α=\mathbf y^T\mathbf x
α=yTx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
y
\mathbf y
y是
n
×
1
n\times 1
n×1的列向量,并且
x
,
y
\mathbf x,\mathbf y
x,y都是关于向量
z
\mathbf z
z的函数,则:
∂
α
∂
z
=
x
T
∂
y
∂
z
+
y
T
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT∂z∂y+yT∂z∂x
证明:
α
=
∑
i
=
1
N
x
i
y
i
α=\sum^N_{i=1}x_iy_i
α=i=1∑Nxiyi
向量
z
\mathbf z
z的第k个元素的微分:
∂
α
∂
z
k
=
∑
i
=
1
n
(
x
i
∂
y
i
∂
z
k
+
y
i
∂
x
i
∂
z
k
)
\frac {\partial α}{\partial \mathbf z_k}=\sum^n_{i=1} \left(x_i\frac {\partial y_i}{\partial z_k}+y_i\frac {\partial x_i}{\partial z_k}\right)
∂zk∂α=i=1∑n(xi∂zk∂yi+yi∂zk∂xi)
所以我们可以得出;
∂
α
∂
z
=
∂
α
∂
y
∂
y
∂
z
+
∂
α
∂
x
∂
x
∂
z
=
x
T
∂
y
∂
z
+
y
T
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}= \frac {\partial α}{\partial \mathbf y}\frac {\partial \mathbf y}{\partial \mathbf z}+\frac {\partial α}{\partial \mathbf x}\frac {\partial \mathbf x}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=∂y∂α∂z∂y+∂x∂α∂z∂x=xT∂z∂y+yT∂z∂x
命题7
假设标量
α
α
α为
α
=
x
T
x
α=\mathbf x^T\mathbf x
α=xTx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,并且
x
\mathbf x
x是关于向量
z
\mathbf z
z的函数,则:
∂
α
∂
z
=
2
x
T
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=2\mathbf x^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=2xT∂z∂x
证明:由结论6证明
命题8
假设标量
α
α
α为
α
=
y
T
A
x
α=\mathbf y^TA\mathbf x
α=yTAx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
y
\mathbf y
y是
m
×
1
m\times 1
m×1的列向量,
A
A
A是
n
×
n
n \times n
n×n的矩阵,
A
A
A不依赖
z
\mathbf z
z,并且
x
,
y
\mathbf x,\mathbf y
x,y是关于向量
z
\mathbf z
z的函数,则:
∂
α
∂
z
=
x
T
A
T
∂
y
∂
z
+
y
T
A
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^TA^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xTAT∂z∂y+yTA∂z∂x
证明:令:
w
T
=
y
T
A
\mathbf w^T=\mathbf y^TA
wT=yTA
则
α
α
α可以写作:
α
=
w
T
x
α=\mathbf w^T \mathbf x
α=wTx
由结论6,我们可以得到:
∂
α
∂
z
=
x
T
∂
w
∂
z
+
w
T
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T\frac {\partial \mathbf w}{\partial \mathbf z}+\mathbf w^T\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT∂z∂w+wT∂z∂x
我们在将
w
\mathbf w
w带回到式子中:
∂
α
∂
z
=
x
T
∂
(
A
T
y
)
∂
z
+
y
T
A
∂
x
∂
z
=
x
T
A
T
∂
y
∂
z
+
y
T
A
∂
x
∂
z
\begin{aligned} \frac {\partial α}{\partial \mathbf z}&=\mathbf x^T\frac {\partial ( A^T \mathbf y)}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}\\ &=\mathbf x^TA^T\frac {\partial \mathbf y}{\partial \mathbf z}+\mathbf y^TA\frac {\partial \mathbf x}{\partial \mathbf z}\\ \end{aligned}
∂z∂α=xT∂z∂(ATy)+yTA∂z∂x=xTAT∂z∂y+yTA∂z∂x
命题9
假设标量
α
α
α为
α
=
x
T
A
x
α=\mathbf x^TA\mathbf x
α=xTAx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
n
×
n
n \times n
n×n的矩阵,
A
A
A不依赖
z
\mathbf z
z,并且
x
\mathbf x
x是关于向量
z
\mathbf z
z的函数,则:
∂
α
∂
z
=
x
T
(
A
+
A
T
)
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=\mathbf x^T(A+A^T)\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=xT(A+AT)∂z∂x
证明:有结论8得出
命题10
假设标量
α
α
α为,其中A为对称矩阵
α
=
x
T
A
x
α=\mathbf x^TA\mathbf x
α=xTAx
其中
x
\mathbf x
x是
n
×
1
n\times 1
n×1的列向量,
A
A
A是
n
×
n
n \times n
n×n的矩阵,
A
A
A不依赖
z
\mathbf z
z,并且
x
\mathbf x
x是关于向量
z
\mathbf z
z的函数,则:
∂
α
∂
z
=
2
x
T
A
∂
x
∂
z
\frac {\partial α}{\partial \mathbf z}=2\mathbf x^TA\frac {\partial \mathbf x}{\partial \mathbf z}
∂z∂α=2xTA∂z∂x
证明:有结论9得出
命题11:
如果A是一个
m
×
m
m\times m
m×m的可逆矩阵,那么A对标量
α
α
α的偏微分是:
∂
A
−
1
∂
α
=
−
A
−
1
∂
A
∂
α
A
−
1
\frac {\partial A^{-1}}{\partial α}=-A^{-1}\frac {\partial A}{\partial α}A^{-1}
∂α∂A−1=−A−1∂α∂AA−1
证明:由定义可知
A
−
1
A
=
I
A^{-1}A=I
A−1A=I
等式两边对标量
α
α
α微分:
A
−
1
∂
A
∂
α
+
∂
A
∂
α
A
=
0
A^{-1}\frac {\partial A}{\partial α}+\frac {\partial A}{\partial α}A=0
A−1∂α∂A+∂α∂AA=0
移项:
∂
A
−
1
∂
α
=
−
A
−
1
∂
A
∂
α
A
−
1
\frac {\partial A^{-1}}{\partial α}=-A^{-1}\frac {\partial A}{\partial α}A^{-1}
∂α∂A−1=−A−1∂α∂AA−1