目录
1 线性代数
1.1 基本概念
-
向量 (vector):未特殊说明的情况下,所有的向量都默认为列向量;
x ⃗ = ( x 1 , . . . , x n ) T = [ x 1 ⋮ x n ] ∈ R n \vec{x}=(x_1,...,x_n)^T=\begin{bmatrix}x_1\\\vdots \\x_n \end{bmatrix}\in\mathbb{R}^n x=(x1,...,xn)T=⎣⎢⎡x1⋮xn⎦⎥⎤∈Rn -
矩阵 (matrix):同时也可以用 ( x i , j ) n × d (x_{i,j})_{n\times d} (xi,j)n×d 或 [ x i , j ] n × d [x_{i,j}]_{n\times d} [xi,j]n×d 表达;
X = [ x 1 , 1 ⋯ x 1 , d ⋮ ⋱ ⋮ x n , 1 ⋯ x n , d ] ∈ R n × d X=\begin{bmatrix} x_{1,1} & \cdots & x_{1,d} \\ \vdots & \ddots & \vdots \\ x_{n,1} & \cdots & x_{n,d} \end{bmatrix}\in\mathbb{R}^{n\times d} X=⎣⎢⎡x1,1⋮xn,1⋯⋱⋯x1,d⋮xn,d⎦⎥⎤∈Rn×d
1.2 运算
-
点积 (dot product):向量乘积;其中,点积为 0 的两个向量相互 正交 (othogonal);
x ⃗ T ⋅ y ⃗ = ∑ i x i y i \vec{x}^T\cdot \vec{y}=\sum_ix_iy_i xT⋅y=i∑xiyi -
矩阵乘积 (matrix multiplication):如其名;
X n × d ⋅ Y d × c = [ ∑ i d x 1 , i y i , 1 ⋯ ∑ i d x 1 , i y i , c ⋮ ⋱ ⋮ ∑ i d x n , i y i , 1 ⋯ ∑ i d x n , i y i , c ] X_{n\times d}\cdot Y_{d\times c}=\begin{bmatrix} \sum_i^d x_{1,i}y_{i,1} & \cdots & \sum_i^d x_{1,i}y_{i,c} \\ \vdots & \ddots & \vdots \\ \sum_i^d x_{n,i}y_{i,1} & \cdots & \sum_i^d x_{n,i}y_{i,c} \end{bmatrix} Xn×d⋅Yd×c=⎣⎢⎡∑idx1,iyi,1⋮∑idxn,iyi,1⋯⋱⋯∑idx1,iyi,c⋮∑idxn,iyi,c⎦⎥⎤ -
迹 (trace): t r ( X ) tr(X) tr(X);对角线上所有元素之和;
t r ( X ) = ∑ i x i , i tr(X)=\sum_ix_{i,i} tr(X)=i∑xi,i -
范数 (norm): F F F 等于 1 时称为 L 1 L_1 L1 范数, F F F 等于 2 时称为 L 2 L_2 L2 范数,依此类推;向量的 L 2 L_2 L2 范数中文里又称为向量的 模,可以简化为 ∣ ∣ x ⃗ ∣ ∣ ||\vec{x}|| ∣∣x∣∣,代表了向量的长度;
∣ ∣ X ∣ ∣ F = ( ∑ i , j ∣ x i , j ∣ F ) 1 F ||X||_F=\Big(\sum_{i,j}|x_{i,j}|^F\Big)^{\frac{1}{F}} ∣∣X∣∣F=(i,j∑∣xi,j∣F)F1 -
行列式 (determinant):记为 d e t ( X ) det(X) det(X) 或 ∣ X ∣ |X| ∣X∣;仅对于方阵;
d e t ( X ) = ∑ i = 0 n − 1 ( ∏ j = 1 n x j , i + j − ∏ j = 0 n − 1 x n − j , i + j + 1 ) , x i , k = x i , k − n if k > n det(X)=\sum_{i=0}^{n-1}\Big(\prod_{j=1}^{n}x_{j,i+j}-\prod_{j=0}^{n-1}x_{n-j,i+j+1}\Big),~~~~x_{i,k}=x_{i,k-n}~~\text{if}~~k>n det(X)=i=0∑n−1(j=1∏nxj,i+j−j=0∏n−1xn−j,i+j+1), xi,k=xi,k−n if k>n -
余子式 (cofactor): M i , j M_{i,j} Mi,j 代表从 X X X 中删去第 i i i 行和第 j j j 列后矩阵的行列式;
A i , j = ( − 1 ) i + j M i , j A_{i,j}=(-1)^{i+j}M_{i,j} Ai,j=(−1)i+jMi,j -
伴随矩阵 (adjoint matrix):
A ∗ = [ A 1 , 1 ⋯ A n , 1 ⋮ ⋱ ⋮ A 1 , n ⋯ A n , n ] A^*=\begin{bmatrix} A_{1,1} & \cdots & A_{n,1} \\ \vdots & \ddots & \vdots \\ A_{1,n} & \cdots & A_{n,n} \end{bmatrix} A∗=⎣⎢⎡A1,1⋮A1,n⋯⋱⋯An,1⋮An,n⎦⎥⎤ -
逆矩阵 (inverse matrix): X − 1 X^{-1} X−1; X − 1 X = I X^{-1}X=I X−1X=I;
X − 1 = A ∗ d e t ( A ) X^{-1}=\frac{A^*}{det(A)} X−1=det(A)A∗ -
哈达马积 (Hadamard):又称为逐元素积;
X ∘ Y = [ x 1 , 1 y 1 , 1 ⋯ x 1 , d y 1 , d ⋮ ⋱ ⋮ x n , 1 y n , 1 ⋯ x n , d y n , d ] X\circ Y=\begin{bmatrix} x_{1,1}y_{1,1} & \cdots & x_{1,d}y_{1,d} \\ \vdots & \ddots & \vdots \\ x_{n,1}y_{n,1} & \cdots & x_{n,d}y_{n,d} \end{bmatrix} X∘Y=⎣⎢⎡x1,1y1,1⋮xn,1yn,1⋯⋱⋯x1,dy1,d⋮xn,dyn,d⎦⎥⎤ -
克罗内克积 (Kronecker product):
X ⊗ Y = [ x 1 , 1 Y ⋯ x 1 , d Y ⋮ ⋱ ⋮ x n , 1 Y ⋯ x n , d Y ] X\otimes Y=\begin{bmatrix} x_{1,1}Y & \cdots & x_{1,d}Y \\ \vdots & \ddots & \vdots \\ x_{n,1}Y & \cdots & x_{n,d}Y \end{bmatrix} X⊗Y=⎣⎢⎡x1,1Y⋮xn,1Y⋯⋱⋯x1,dY⋮xn,dY⎦⎥⎤ -
笛卡尔乘积 (Cartisian product):
A × B = { ( a , b ) ∣ a ∈ A , b ∈ B } A\times B=\{(a,b)|a\in A,b\in B\} A×B={(a,b)∣a∈A,b∈B}
1.3 求导
- 一阶,对向量:
∂ ( a ⃗ T x ⃗ ) ∂ x ⃗ = a ⃗ \frac{\partial (\vec{a}^T\vec{x})}{\partial \vec{x}}=\vec{a} ∂x∂(aTx)=a
∂ ( x ⃗ T A x ⃗ ) ∂ x ⃗ = ( A + A T ) x ⃗ \frac{\partial (\vec{x}^TA\vec{x})}{\partial \vec{x}}=(A+A^T)\vec{x} ∂x∂(xTAx)=(A+AT)x
∂ [ ( A x ⃗ + a ⃗ ) T C ( B x ⃗ + b ⃗ ) ] ∂ x ⃗ = A T C ( B x ⃗ + b ⃗ ) + B T C ( A x ⃗ + a ⃗ ) \frac{\partial [(A\vec{x}+\vec{a})^TC(B\vec{x}+\vec{b})]}{\partial \vec{x}}=A^TC(B\vec{x}+\vec{b})+B^TC(A\vec{x}+\vec{a}) ∂x∂[(Ax+a)TC(Bx+b)]=ATC(Bx+b)+BTC(Ax+a)
- 一阶,对矩阵
∂ ( a ⃗ T X b ⃗ ) ∂ X = a ⃗ b ⃗ T = a ⃗ ⊗ b ⃗ \frac{\partial (\vec{a}^TX\vec{b})}{\partial X}=\vec{a}\vec{b}^T=\vec{a} \otimes \vec{b} ∂X∂(aTXb)=abT=a⊗b
∂ ( a ⃗ T X T b ⃗ ) ∂ X = b ⃗ a ⃗ T = b ⃗ ⊗ a ⃗ \frac{\partial (\vec{a}^TX^T\vec{b})}{\partial X}=\vec{b}\vec{a}^T=\vec{b} \otimes \vec{a} ∂X∂(aTXTb)=baT=b⊗a
∂ ( a ⃗ T X T X b ⃗ ) ∂ X = X ( a ⃗ b ⃗ T + b ⃗ a ⃗ T ) \frac{\partial (\vec{a}^TX^TX\vec{b})}{\partial X}=X(\vec{a}\vec{b}^T+\vec{b}\vec{a}^T) ∂X∂(aTXTXb)=X(abT+baT)
∂ ( b ⃗ T X T A X c ⃗ ) ∂ X = A T X b ⃗ c ⃗ T + A X c ⃗ b ⃗ T \frac{\partial (\vec{b}^TX^TAX\vec{c})}{\partial X}=A^TX\vec{b}\vec{c}^T+AX\vec{c}\vec{b}^T ∂X∂(bTXTAXc)=ATXbcT+AXcbT
∂ [ ( X b ⃗ + c ⃗ ) A ( X b ⃗ + c ⃗ ) ] ∂ X = ( A + A T ) ( X b ⃗ + c ⃗ ) b ⃗ T \frac{\partial [(X\vec{b}+\vec{c})A(X\vec{b}+\vec{c})]}{\partial X}=(A+A^T)(X\vec{b}+\vec{c})\vec{b}^T ∂X∂[(Xb+c)A(Xb+c)]=(A+AT)(Xb+c)bT
1.4 偏导
以下仅呈现部分二维范围内的偏导结果;
-
一阶,标量对标量:
∂ z ∂ x \frac{\partial z}{\partial x} ∂x∂z -
一阶,标量对向量:
∂ z ∂ x ⃗ = ( ∂ z ∂ x 1 , . . . , ∂ z ∂ x n ) T \frac{\partial z}{\partial \vec{x}}=\big(\frac{\partial z}{\partial x_1},...,\frac{\partial z}{\partial x_n}\big)^T ∂x∂z=(∂x1∂z,...,∂xn∂z)T -
一阶,标量对矩阵:
∂ z ∂ X = [ ∂ z ∂ x 1 , 1 ⋯ ∂ z ∂ x 1 , d ⋮ ⋱ ⋮ ∂ z ∂ x n , 1 ⋯ ∂ z ∂ x n , d ] \frac{\partial z}{\partial X}=\begin{bmatrix} \frac{\partial z}{\partial x_{1,1}} & \cdots & \frac{\partial z}{\partial x_{1,d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z}{\partial x_{n,1}} & \cdots & \frac{\partial z}{\partial x_{n,d}} \end{bmatrix} ∂X∂z=⎣⎢⎢⎡∂x1,1∂z⋮∂xn,1∂z⋯⋱⋯∂x1,d∂z⋮∂xn,d∂z⎦⎥⎥⎤ -
一阶,向量对标量:
∂ z ⃗ ∂ x = ( ∂ z 1 ∂ x , . . . , ∂ z n ∂ x ) T \frac{\partial \vec{z}}{\partial x}=\big(\frac{\partial z_1}{\partial x},...,\frac{\partial z_n}{\partial x}\big)^T ∂x∂z=(∂x∂z1,...,∂x∂zn)T -
一阶,向量对向量:即 雅各比矩阵 (Jaccob matrix);
∂ z ⃗ ∂ x ⃗ = [ ∂ z 1 ∂ x 1 ⋯ ∂ z 1 ∂ x d ⋮ ⋱ ⋮ ∂ z n ∂ x 1 ⋯ ∂ z n ∂ x d ] \frac{\partial \vec{z}}{\partial \vec{x}}=\begin{bmatrix} \frac{\partial z_1}{\partial x_{1}} & \cdots & \frac{\partial z_1}{\partial x_{d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_n}{\partial x_{1}} & \cdots & \frac{\partial z_n}{\partial x_{d}} \end{bmatrix} ∂x∂z=⎣⎢⎡∂x1∂z1⋮∂x1∂zn⋯⋱⋯∂xd∂z1⋮∂xd∂zn⎦⎥⎤ -
一阶,矩阵对标量:
∂ Z ∂ x = [ ∂ z 1 , 1 ∂ x ⋯ ∂ z 1 , d ∂ x ⋮ ⋱ ⋮ ∂ z n , 1 ∂ x ⋯ ∂ z n , d ∂ x ] \frac{\partial Z}{\partial x}=\begin{bmatrix} \frac{\partial z_{1,1}}{\partial x} & \cdots & \frac{\partial z_{1,d}}{\partial x} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_{n,1}}{\partial x} & \cdots & \frac{\partial z_{n,d}}{\partial x} \end{bmatrix} ∂x∂Z=⎣⎢⎡∂x∂z1,1⋮∂x∂zn,1⋯⋱⋯∂x∂z1,d⋮∂x∂zn,d⎦⎥⎤ -
二阶,标量对向量:即 海森矩阵 (Hessian matrix);
∂ z ∂ x ⃗ = [ ∂ 2 z ∂ x 1 ∂ x 1 ⋯ ∂ 2 z ∂ x 1 ∂ x d ⋮ ⋱ ⋮ ∂ 2 z ∂ x n ∂ x 1 ⋯ ∂ 2 z ∂ x n ∂ x d ] \frac{\partial z}{\partial \vec{x}}=\begin{bmatrix} \frac{\partial^2 z}{\partial x_{1}\partial x_{1}} & \cdots & \frac{\partial^2 z}{\partial x_{1}\partial x_{d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 z}{\partial x_{n}\partial x_{1}} & \cdots & \frac{\partial^2 z}{\partial x_{n}\partial x_{d}} \end{bmatrix} ∂x∂z=⎣⎢⎢⎡∂x1∂x1∂2z⋮∂xn∂x1∂2z⋯⋱⋯∂x1∂xd∂2z⋮∂xn∂xd∂2z⎦⎥⎥⎤
1.5 矩阵分解
-
特征分解 (eigen decomposition):从方阵 A A A 中提取满足 λ x ⃗ = A x ⃗ \lambda \vec{x}=A\vec{x} λx=Ax 的 λ \lambda λ 与 x ⃗ \vec{x} x;其中, λ \lambda λ 称为 特征值 (eigen value), x ⃗ \vec{x} x 称为 A A A 的 特征向量 (eigen vector); Σ \Sigma Σ 为包含特征值的对角矩阵, W W W 中的第 i i i 列与 Σ \Sigma Σ 的第 i i i 个元素分别为矩阵 A A A 的第 i i i 个特征向量和第 i i i 个特征值;同时,所有的特征向量相互正交; W W T = I WW^T=I WWT=I;
A = W Σ W T A=W\Sigma W^{T} A=WΣWT -
奇异值分解 (singular value decomposition):当 A A A 不为方阵时,转而对 A A A 的协方差矩阵进行的特征分解;得出的 U U U 称为 A A A 的 左奇异矩阵 (left singular matrix),列向量为 左奇异向量 (left sigular vector); V V V 称为 A A A 的 右奇异矩阵 (right singular matrix),列向量同理; Σ \Sigma Σ 为奇异值矩阵,对角线上依次排列着原矩阵 A A A 从大到小特征值的平方根; U V T = I UV^T=I UVT=I;
A T A = U Σ V T A^TA=U\Sigma V^T ATA=UΣVT
1.6 正定矩阵
- 半正定 (positive semi-definite):对于任意 x x x, x T A x ≥ 0 x^TAx\ge 0 xTAx≥0;所有特征值 λ i ≥ 0 \lambda_i\ge 0 λi≥0。
- 正定 (positive definite):对于任意 x x x, x T A x > 0 x^TAx> 0 xTAx>0;所有特征值 λ i > 0 \lambda_i> 0 λi>0。
- 半负定 (negative semi-definite):对于任意 x x x, x T A x ≤ 0 x^TAx\le 0 xTAx≤0;所有特征值 λ i ≤ 0 \lambda_i\le 0 λi≤0。
- 负定 (negative definite):对于任意 x x x, x T A x < 0 x^TAx< 0 xTAx<0;所有特征值 λ i < 0 \lambda_i<0 λi<0。
- 不定 (indefinite):既非半正定,也非半负定;特征值有正有负。
1.7 相似性
- 余弦相似度 (Cosine Similarity):空间中向量夹角的余弦值,用于衡量向量的方向是否一致;
c o s ( x ⃗ , y ⃗ ) = x ⃗ ⋅ y ⃗ ∣ ∣ x ⃗ ∣ ∣ ⋅ ∣ ∣ y ⃗ ∣ ∣ cos(\vec{x},\vec{y})=\frac{\vec{x}\cdot \vec{y}}{||\vec{x}||\cdot ||\vec{y}||} cos(x,y)=∣∣x∣∣⋅∣∣y∣∣x⋅y - 欧式距离 (Euclidean Distance):两点之间的最短距离,是对于向量长度和方向的综合评价标准;
Euclidean ( x ⃗ , y ⃗ ) = ∣ ∣ x ⃗ − y ⃗ ∣ ∣ = ( ∑ i ∣ x i − y i ∣ 2 ) 1 2 \text{Euclidean}(\vec{x},\vec{y})=||\vec{x}-\vec{y}||=\big(\sum_i|x_i-y_i|^2\big)^{\frac{1}{2}} Euclidean(x,y)=∣∣x−y∣∣=(i∑∣xi−yi∣2)21 - 曼哈顿距离 (Manhattan Distance):两点之间的棋盘距离,在特定场景下效用显著;
Manhattan ( x ⃗ , y ⃗ ) = ∑ i ∣ x ⃗ i − y ⃗ i ∣ \text{Manhattan}(\vec{x},\vec{y})=\sum_i|\vec{x}_i-\vec{y}_i| Manhattan(x,y)=i∑∣xi−yi∣ - 闵氏距离 (Minkowski Distance):欧式距离和曼哈顿距离的泛化版本;
Minkowski ( x ⃗ , y ⃗ ) = ( ∑ i ∣ x i − y i ∣ p ) 1 p \text{Minkowski}(\vec{x},\vec{y})=\big(\sum_i|x_i-y_i|^p\big)^{\frac{1}{p}} Minkowski(x,y)=(i∑∣xi−yi∣p)p1 - 相关系数 (Correlation):统计学的常见概念,衡量同质化向量的元素协同变化趋势;
ρ ( x ⃗ , y ⃗ ) = C o v ( x ⃗ , y ⃗ ) σ x ⃗ ⋅ σ y ⃗ = E ( x y ) − E ( x ) E ( y ) x ) ⋅ V a r ( y ) \rho(\vec{x},\vec{y})=\frac{Cov(\vec{x},\vec{y})}{\sigma_{\vec{x}} \cdot \sigma_{\vec{y}}}=\frac{E(xy)-E(x)E(y)}{\sqrt{x)\cdot Var(y)}} ρ(x,y)=σx⋅σyCov(x,y)=x)⋅Var(y)E(xy)−E(x)E(y) - KL 散度 (Kullback-Leibler Divergence):又称为相对熵,是信息论领域的重要概念,衡量两种概率分布的单向相似度;
KL Divergence ( p ∣ ∣ q ) = ∑ x ∈ q p ( x ) l o g p ( x ) q ( x ) \text{KL Divergence}(p||q)=\sum_{x\in q}p(x)log\frac{p(x)}{q(x)} KL Divergence(p∣∣q)=x∈q∑p(x)logq(x)p(x) - Jaccard 相似度 (Jaccard Similarity):交集长度除以并集长度,衡量两种集合或布尔序列的相似度;
Jaccard Similarity ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ \text{Jaccard Similarity}(A,B)=\frac{|A\cap B|}{|A\cup B|} Jaccard Similarity(A,B)=∣A∪B∣∣A∩B∣
2 概率论
2.1 基本概念
假定 X X X 为离散型变量, x x x 为连续型变量;
-
概率质量函数 (PMF, abbr. probability mass function); P ( X ) P(X) P(X);
∑ i P ( X i ) = 1 \sum_iP(X_i)=1 i∑P(Xi)=1 -
概率密度函数 (PDF, abbr. probability density function); p ( x ) p(x) p(x);
∫ p ( x ) d x = 1 \int p(x)dx=1 ∫p(x)dx=1 -
条件概率 (conditional probability):
P ( Y ∣ X ) P(Y|X) P(Y∣X) -
联合概率 (joint probability):
P ( X , Y ) = P ( Y ∣ X ) P ( X ) P(X,Y)=P(Y|X)P(X) P(X,Y)=P(Y∣X)P(X) -
全概率 (total probability):
P ( Y ) = ∑ i P ( Y ∣ X i ) P ( X ) P(Y)=\sum_iP(Y|X_i)P(X) P(Y)=i∑P(Y∣Xi)P(X) -
链式法则 (chain rule):
P ( x 1 , . . . , x n ) = P ( x 1 ) ∏ i = 2 n P ( x i ∣ x i − 1 , . . . , x 1 ) P(x_1,...,x_n)=P(x_1)\prod_{i=2}^nP(x_i|x_{i-1},...,x_1) P(x1,...,xn)=P(x1)i=2∏nP(xi∣xi−1,...,x1) -
独立事件 (independent events):记作 X ⊥ Y X\perp Y X⊥Y;
P ( X , Y ) = P ( X ) P ( Y ) P(X,Y)=P(X)P(Y) P(X,Y)=P(X)P(Y) -
条件独立事件 (conditional independent events):记作 X ⊥ Y ∣ Z X\perp Y| Z X⊥Y∣Z;
P ( X , Y ∣ Z ) = P ( X ∣ Z ) P ( Y ∣ Z ) P(X,Y|Z)=P(X|Z)P(Y|Z) P(X,Y∣Z)=P(X∣Z)P(Y∣Z) -
联合概率分布 (joint probability distribution): p ( x , y ) p(x,y) p(x,y);
P ( X ≤ a , Y ≤ b ) = ∫ − ∞ a ∫ − ∞ b p ( x , y ) d x d y = ∫ − ∞ a p x ( x ) d x = ∫ − ∞ b p y ( y ) d y P(X\le a,Y\le b)=\int_{-\infty}^a\int_{-\infty}^b p(x,y)dxdy=\int_{-\infty}^ap_x(x)dx=\int_{-\infty}^bp_y(y)dy P(X≤a,Y≤b)=∫−∞a∫−∞bp(x,y)dxdy=∫−∞apx(x)dx=∫−∞bpy(y)dy -
先验概率 (prior probability):
P ( X ) P(X) P(X) -
后验概率 (posterior probability):
P ( X ∣ Y ) P(X|Y) P(X∣Y) -
贝叶斯公式 (bayes formula):
P ( X i ∣ Y ) = P ( Y ∣ X i ) P ( X i ) ∑ i P ( Y ∣ X i ) P ( X i ) P(X_i|Y)=\frac{P(Y|X_i)P(X_i)}{\sum_iP(Y|X_i)P(X_i)} P(Xi∣Y)=∑iP(Y∣Xi)P(Xi)P(Y∣Xi)P(Xi)
2.2 期望与方差
-
期望 (mean):常记为 μ \mu μ;若级数/极限不收敛,则期望不存在;
E [ x ] = ∫ p ( x ) x d x \mathbb{E}[x]=\int p(x)xdx E[x]=∫p(x)xdx -
方差 (variance):常记为 σ 2 \sigma^2 σ2,表示 标准差 (standard deviation) 的平方;
V a r [ x ] = E [ ( x − μ ) 2 ] = E [ x 2 ] − ( E [ x ] ) 2 Var[x]=\mathbb{E}[(x-\mu)^2]=\mathbb{E}[x^2]-(\mathbb{E}[x])^2 Var[x]=E[(x−μ)2]=E[x2]−(E[x])2 -
协方差 (covariance):
C o v [ x , y ] = E [ ( x − E [ x ] ) ( y − E [ y ] ) ] = E [ x y ] − E [ x ] E [ y ] Cov[x,y]=\mathbb{E}[(x-\mathbb{E}[x])(y-\mathbb{E}[y])]=\mathbb{E}[xy]-\mathbb{E}[x]\mathbb{E}[y] Cov[x,y]=E[(x−E[x])(y−E[y])]=E[xy]−E[x]E[y] -
相关系数 (correlation):常记为 ρ \rho ρ;
C o r r [ x , y ] = C o v [ x , y ] σ x σ y ∈ [ − 1 , 1 ] Corr[x,y]=\frac{Cov[x,y]}{\sigma_x\sigma_y}\in[-1,1] Corr[x,y]=σxσyCov[x,y]∈[−1,1] -
其他性质:
E [ x y ] = E [ x ] E [ y ] \mathbb{E}[xy]=\mathbb{E}[x]\mathbb{E}[y] E[xy]=E[x]E[y]
E [ k x + y ] = k E [ x ] + E [ y ] \mathbb{E}[kx+y]=k\mathbb{E}[x]+\mathbb{E}[y] E[kx+y]=kE[x]+E[y]
V a r [ k x + y ] = k 2 V a r [ x ] + V a r [ y ] + 2 C o v [ k x , y ] Var[kx+y]=k^2Var[x]+Var[y]+2Cov[kx,y] Var[kx+y]=k2Var[x]+Var[y]+2Cov[kx,y]
C o v [ k x 1 + x 2 , y ] = C o v [ k x 1 , y ] + C o v [ x 2 , y ] Cov[kx_1+x_2,y]=Cov[kx1,y]+Cov[x2,y] Cov[kx1+x2,y]=Cov[kx1,y]+Cov[x2,y]
2.3 概率分布
-
均匀分布 (uniform distribution): μ = a + b 2 \mu=\frac{a+b}{2} μ=2a+b, σ 2 = ( b − a ) 2 12 \sigma^2=\frac{(b-a)^2}{12} σ2=12(b−a)2;
p ( x ) = 1 b − a , x ∈ [ a , b ] p(x)=\frac{1}{b-a},~~~~x\in[a,b] p(x)=b−a1, x∈[a,b] -
伯努利分布 (Bernoulli distribution):又称为 二项分布; μ = n ϕ \mu=n\phi μ=nϕ, σ 2 = n ϕ ( 1 − ϕ ) \sigma^2=n\phi(1-\phi) σ2=nϕ(1−ϕ);
p ( x ) = ( n x ) ϕ x ( 1 − ϕ ) n − x , x ∈ Z + p(x)= \dbinom{n}{x}\phi^x(1-\phi)^{n-x},~~~~x\in \mathbb{Z^+} p(x)=(xn)ϕx(1−ϕ)n−x, x∈Z+ -
正太分布 (normal distribution):又称为 高斯分布 (Gaussian distribution),记为 x ∼ N ( μ , σ 2 ) x\sim N(\mu,\sigma^2) x∼N(μ,σ2);
p ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 , x ∈ R p(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x-\mu)^2}{2\sigma^2}},~~~~x\in\mathbb{R} p(x)=2πσ1e2σ2−(x−μ)2, x∈R -
对数正态分布 (log-normal distribution): x x x 的对数符合正态分布,即 log ( x ) ∼ N ( μ , σ 2 ) \log(x)\sim N(\mu,\sigma^2) log(x)∼N(μ,σ2);
p ( x ) = 1 2 π σ e − ( log ( x ) − μ ) 2 2 σ 2 , x ∈ R + p(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(\log(x)-\mu)^2}{2\sigma^2}},~~~~x\in\mathbb{R^+} p(x)=2πσ1e2σ2−(log(x)−μ)2, x∈R+ -
多维正态分布 (multinormal distribution):以二维为例;
p ( x 1 , x 2 ) = 1 2 π σ 1 σ 2 1 − ρ 2 e − 1 2 ( 1 − ρ 2 ) [ ( x 1 − μ 1 ) 2 σ 1 2 − 2 ρ ( x 1 − μ 1 ) ( x 2 − μ 2 ) σ 1 σ 2 + ( x 2 − μ 2 ) 2 σ 2 2 ] , x 1 , x 2 ∈ R p(x_1,x_2)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}e^{\frac{-1}{2(1-\rho^2)}[\frac{(x_1-\mu_1)^2}{\sigma_1^2}-2\rho\frac{(x_1-\mu_1)(x_2-\mu_2)}{\sigma_1\sigma_2}+\frac{(x_2-\mu_2)^2}{\sigma_2^2}]},~~~~x_1,x_2\in\mathbb{R} p(x1,x2)=2πσ1σ21−ρ21e2(1−ρ2)−1[σ12(x1−μ1)2−2ρσ1σ2(x1−μ1)(x2−μ2)+σ22(x2−μ2)2], x1,x2∈R -
指数分布 (exponential distribution): μ = 1 λ \mu=\frac{1}{\lambda} μ=λ1, σ 2 = 1 λ 2 \sigma^2=\frac{1}{\lambda^2} σ2=λ21;
p ( x ) = λ e λ x , x ∈ R + p(x)=\frac{\lambda}{e^{\lambda x}},~~~~x\in\mathbb{R^+} p(x)=eλxλ, x∈R+ -
泊松分布 (Poisson distribution): μ = λ \mu=\lambda μ=λ, σ 2 = λ \sigma^2=\lambda σ2=λ;
p ( x ) = λ x x ! e − λ , x ∈ Z + p(x)=\frac{\lambda^x}{x!}e^{-\lambda},~~~~x\in\mathbb{Z}^+ p(x)=x!λxe−λ, x∈Z+ -
拉普拉斯分布 (Laplace distribution): σ = 2 γ 2 \sigma=2\gamma^2 σ=2γ2;
p ( x ) = 1 2 γ e − ∣ x − μ ∣ γ , x ∈ R p(x)=\frac{1}{2\gamma}e^{-\frac{|x-\mu|}{\gamma}},~~~~x\in\mathbb{R} p(x)=2γ1e−γ∣x−μ∣, x∈R -
贝塔分布 (Beta distribution):记为 x ∼ B ( α , β ) x\sim B(\alpha,\beta) x∼B(α,β); μ = α α + β \mu=\frac{\alpha}{\alpha+\beta} μ=α+βα;伽马函数 (Gamma function) Γ ( x ) = ∫ 0 ∞ t x − 1 e − t d t = ( x − 1 ) ! \Gamma(x)=\int_0^\infty t^{x-1}e^{-t}dt=(x-1)! Γ(x)=∫0∞tx−1e−tdt=(x−1)!;
p ( x ) = Γ ( α + β ) Γ ( α ) Γ ( β ) x α − 1 ( 1 − x ) β − 1 , x ∈ Z + p(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1},~~~~x\in\mathbb{Z^+} p(x)=Γ(α)Γ(β)Γ(α+β)xα−1(1−x)β−1, x∈Z+ -
迪利克雷分布 (Dirichlet distribution):贝塔分布在二维以上的情形;
Dirichlet ( x 1 , . . . , x k ) = Γ ( n 1 + . . . + n k ) Γ ( n 1 ) . . . Γ ( n k ) x 1 n 1 − 1 . . . x k n k − 1 , x 1 , . . . , x k ∈ Z + \text{Dirichlet}(x_1,...,x_k)=\frac{\Gamma(n_1+...+n_k)}{\Gamma(n_1)...\Gamma(n_k)}x_1^{n_1-1}...x_k^{n_k-1},~~~~x_1,...,x_k\in\mathbb{Z^+} Dirichlet(x1,...,xk)=Γ(n1)...Γ(nk)Γ(n1+...+nk)x1n1−1...xknk−1, x1,...,xk∈Z+ -
混合概率分布 (mixed probability distribution):
p ( x ) = ∑ i P ( c i ) p ( x ∣ c i ) p(x)=\sum_iP(c_i)p(x|c_i) p(x)=i∑P(ci)p(x∣ci)
2.4 信息论
-
熵 (entropy):描绘一组概率分布的混乱程度;
H ( X ) = − ∑ x ∈ X p ( x ) log p ( x ) ∈ R + H(X)=-\sum_{x\in X}p(x)\log p(x)\in\mathbb{R^+} H(X)=−x∈X∑p(x)logp(x)∈R+ -
条件熵 (conditional entropy):条件概率的信息混乱程度;
H ( Y ∣ X ) = − ∑ x ∈ X p ( x ) ∑ y ∈ Y p ( y ∣ x ) log p ( y ∣ x ) H(Y|X)=-\sum_{x\in X}p(x)\sum_{y\in Y}p(y|x)\log p(y|x) H(Y∣X)=−x∈X∑p(x)y∈Y∑p(y∣x)logp(y∣x) -
互信息 (mutual information):一组概率分布因另一组概率分布而减少的信息不确定性;
I ( X ; Y ) = ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log p ( x , y ) p ( x ) p ( y ) I(X;Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)} I(X;Y)=x∈X∑y∈Y∑p(x,y)logp(x)p(y)p(x,y) -
KL 散度 (Kullback-Leibler Divergence):又称为 相对熵,衡量两种概率分布的单向相似度;
KL Divergence ( Y ∣ ∣ X ) = ∑ x ∈ X p ( x ) log p ( x ) q ( x ) \text{KL Divergence}(Y||X)=\sum_{x\in X}p(x)\log\frac{p(x)}{q(x)} KL Divergence(Y∣∣X)=x∈X∑p(x)logq(x)p(x) -
交叉熵 (cross entropy):衡量两种标签维度上的概率准确度;
Cross Entropy ( p ) = − y log p ( x ) − ( 1 − y ) log [ 1 − p ( x ) ] , y ∈ { 0 , 1 } \text{Cross Entropy}(p)=-y\log p(x)-(1-y)\log [1-p(x)],~~~~y\in\{0,1\} Cross Entropy(p)=−ylogp(x)−(1−y)log[1−p(x)], y∈{0,1}
3 最优化
3.1 基本概念
在机器学习中,最优化 (optimization) 的目的在于求得参数
θ
\theta
θ, 使得损失函数最小化:
min
θ
L
(
θ
)
\min_\theta L(\theta)
θminL(θ)
s
.
t
.
{
h
i
(
θ
)
=
0
g
j
(
θ
)
≤
0
s.t.\begin{cases} h_i(\theta)=0 \\ g_j(\theta)\le 0 \end{cases}
s.t.{hi(θ)=0gj(θ)≤0
损失函数 (loss function) L L L 也可称为目标函数或代价函数; h h h 和 g g g 为 约束条件 (constraint),诸如 h h h 的约束条件称为 等式约束 (equality constraint),诸如 g g g 的约束条件称为 不等式约束 (inequality constraint)。其他概念:
- 无约束优化问题 (unconstraint optimization problem):不存在约束条件时的优化问题;
- 有约束优化问题 (constraint optimization problem):存在约束条件时的优化问题;
- 凸优化 (convex optimization): L L L、 h h h、 g g g 皆为凸函数时的优化问题;
- 线性规划 (linear programming): L L L、 h h h、 g g g 皆为线性函数时的优化问题;
- 非线性规划 (nonlinear programming): L L L、 h h h、 g g g 中任意一个为非线性函数时的优化问题;
- 二次规划 (quadratic programming): L L L 为二次函数,而 h h h、 g g g 为线性函数时的优化问题;
- 多目标规划 (multi-objective programming): L L L 的输出为向量时的优化问题。
关于函数凸分析的一些等价信息:
Convex | Concave | |
---|---|---|
概念 | λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f [ λ x + ( 1 − λ ) y ] ∀ x , y in X \lambda f(x)+(1-\lambda) f(y)\ge f[\lambda x+(1-\lambda)y]~~\forall x,y~\text{in}~X λf(x)+(1−λ)f(y)≥f[λx+(1−λ)y] ∀x,y in X | λ f ( x ) + ( 1 − λ ) f ( y ) ≤ f [ λ x + ( 1 − λ ) y ] ∀ x , y in X \lambda f(x)+(1-\lambda) f(y)\le f[\lambda x+(1-\lambda)y]~~\forall x,y~\text{in}~X λf(x)+(1−λ)f(y)≤f[λx+(1−λ)y] ∀x,y in X |
属性 | The Hessian Matrix ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x) is positive semi-definite | The Hessian Matrix ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x) is negative semi-definite |
3.2 拉格朗日乘子法
在求解有约束优化问题时,常常通过拉格朗日乘子法,将约束函数代入到目标函数中。以以下优化问题为例:
min
x
f
(
x
)
\min_xf(x)
xminf(x)
s
.
t
.
{
h
(
x
)
=
0
g
(
x
)
≤
0
s.t. \begin{cases}h(x)=0\\g(x)\le 0\end{cases}
s.t.{h(x)=0g(x)≤0
引入拉格朗日乘子
λ
\lambda
λ,将原问题转换为最大化拉格朗日函数的无约束优化 对偶问题 (dual problem):
min
x
max
λ
1
,
λ
2
L
=
f
(
x
)
+
λ
1
h
(
x
)
+
λ
2
g
(
x
)
\min_{x}\max_{\lambda_1,\lambda_2}L=f(x)+\lambda_1h(x)+\lambda_2g(x)
xminλ1,λ2maxL=f(x)+λ1h(x)+λ2g(x)
建立 KKT 条件 (Karush-kuhn-Tucker condition):
{
∂
L
∂
x
=
0
∂
L
∂
λ
1
=
0
∂
L
∂
λ
2
=
0
h
(
x
)
=
0
λ
2
g
(
x
)
=
0
λ
2
≥
0
g
(
x
)
≤
0
\begin{cases} \frac{\partial L}{\partial x}=0 \\ \frac{\partial L}{\partial \lambda_1}=0 \\ \frac{\partial L}{\partial \lambda_2}=0 \\ h(x)=0 \\ \lambda_2g(x)=0\\ \lambda_2\ge 0\\ g(x)\le0 \end{cases}
⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧∂x∂L=0∂λ1∂L=0∂λ2∂L=0h(x)=0λ2g(x)=0λ2≥0g(x)≤0
解出满足以上条件的 x x x 代入目标函数 f ( x ) f(x) f(x) 中,得出最优解。
3.3 凸优化
-
牛顿法 (Newton’s method):泰勒的二阶展开式:
f ( x ) = f ( x 0 ) + f ′ ( x o ) ( x − x 0 ) + f ′ ′ ( x o ) 2 ! ( x − x 0 ) 2 + O ( ( x − x 0 ) 2 ) f(x)=f(x_0)+f'(x_o)(x-x_0)+\frac{f''(x_o)}{2!}(x-x_0)^2+O((x-x_0)^2) f(x)=f(x0)+f′(xo)(x−x0)+2!f′′(xo)(x−x0)2+O((x−x0)2)使用参数作为二阶展开式输入,损失函数作为输出, L = f ( W t − 1 + Δ W ) ≈ f ( W t − 1 ) + f ′ ( W t − 1 ) Δ W + f ′ ′ ( W t − 1 ) 2 ! Δ W 2 L=f(W_{t-1}+\Delta W)\approx f(W_{t-1})+f'(W_{t-1})\Delta W+\frac{f''(W_{t-1})}{2!}\Delta W^2 L=f(Wt−1+ΔW)≈f(Wt−1)+f′(Wt−1)ΔW+2!f′′(Wt−1)ΔW2上式可以改写为: L ( Δ W ) = f ′ ′ ( W t − 1 ) 2 ! Δ W 2 + f ′ ( W t − 1 ) Δ W + f ( W t − 1 ) L(\Delta W)=\frac{f''(W_{t-1})}{2!}\Delta W^2+f'(W_{t-1})\Delta W+f(W_{t-1}) L(ΔW)=2!f′′(Wt−1)ΔW2+f′(Wt−1)ΔW+f(Wt−1)为求解 Δ W \Delta W ΔW,令 L ′ ( Δ W ) = 0 L'(\Delta W)=0 L′(ΔW)=0, f ′ ′ ( W t − 1 ) Δ W + f ′ ( W t − 1 ) = 0 f''(W_{t-1})\Delta W+f'(W_{t-1})=0 f′′(Wt−1)ΔW+f′(Wt−1)=0目标参数的更新值 Δ W = − f ′ ′ ( W t − 1 ) − 1 f ′ ( W t − 1 ) \Delta W=-f''(W_{t-1})^{-1}f'(W_{t-1}) ΔW=−f′′(Wt−1)−1f′(Wt−1)若损失 L L L 为标量,而 W W W 为向量,则 f ′ ′ ( W t − 1 ) f''(W_{t-1}) f′′(Wt−1) 为海森矩阵,因此上式又可以表示为: Δ W = − H t − 1 − 1 f ′ ( W t − 1 ) \Delta W=-H_{t-1}^{-1}f'(W_{t-1}) ΔW=−Ht−1−1f′(Wt−1)计算完成,对原参数进行更新:
W t ← W t − 1 + α Δ W W_t\leftarrow W_{t-1}+\alpha \Delta W Wt←Wt−1+αΔW -
拟牛顿法 (Quasi-Newton’s method):为解决牛顿法直接计算海森矩阵逆矩阵的低效,采用生成海森矩阵近似的方式进行优化,知名的拟牛顿法包括 DFP 算法和 BFGS 算法,感兴趣的读者自行了解。
-
SGD (stochastic gradient descent):应用在小批量数据的梯度下降法,原理与梯度下降法相同;给定模型输出的损失函数 L = f ( Z ( k ) , Y ) L=f(Z^{(k)},Y) L=f(Z(k),Y) 及最后一层模型参数 W ( k ) W^{(k)} W(k), W ( k ) W^{(k)} W(k) 的梯度表示为:
∇ W ( k ) = ∂ L ∂ W ( k ) = ∂ L ∂ Z ( k ) ∂ Z ( k ) ∂ W ( k ) \nabla W^{(k)}=\frac{\partial L}{\partial W^{(k)}}=\frac{\partial L}{\partial Z^{(k)}}\frac{\partial Z^{(k)}}{\partial W^{(k)}} ∇W(k)=∂W(k)∂L=∂Z(k)∂L∂W(k)∂Z(k)对于模型上游的参数 W ( i ) ( i < k ) W^{(i)}(i<k) W(i)(i<k),梯度的反向传导遵循链式法则:
∇ W ( i ) = ∂ L ∂ W ( i ) = ∂ L ∂ Z ( k ) ∂ Z ( k ) ∂ Z ( k − 1 ) . . . ∂ Z ( i ) ∂ W ( i ) \nabla W^{(i)}=\frac{\partial L}{\partial W^{(i)}}=\frac{\partial L}{\partial Z^{(k)}}\frac{\partial Z^{(k)}}{\partial Z^{(k-1)}}...\frac{\partial Z^{(i)}}{\partial W^{(i)}} ∇W(i)=∂W(i)∂L=∂Z(k)∂L∂Z(k−1)∂Z(k)...∂W(i)∂Z(i)梯度计算完成后,使用负梯度对参数进行更新,其中 α \alpha α 为学习率:
W t ( i ) ← W t − 1 ( i ) + ( − α ∇ W ( i ) ) W_t^{(i)}\leftarrow W_{t-1}^{(i)}+(-\alpha \nabla W^{(i)}) Wt(i)←Wt−1(i)+(−α∇W(i)) -
Momentum:为避免参数更新方向的震荡,引入动量因子 β \beta β,
V t = β V t − 1 + ( 1 − β ) Δ W V_t=\beta V_{t-1}+(1-\beta)\Delta W Vt=βVt−1+(1−β)ΔW W t ← W t − 1 + α V t W_t\leftarrow W_{t-1}+\alpha V_t Wt←Wt−1+αVt -
RMSProp:为避免参数更新时学习率过大,引入动态学习率调整机制,
S t = β S t − 1 + ( 1 − β ) Δ W 2 S_t=\beta S_{t-1} + (1-\beta )\Delta W^2 St=βSt−1+(1−β)ΔW2 W t ← W t − 1 + α Δ W S t + ε W_t\leftarrow W_{t-1}+\alpha \frac{\Delta W}{\sqrt{S_t}+\varepsilon} Wt←Wt−1+αSt+εΔW -
Adam:结合 Momentum 与 RMSProp 两者的考虑,
V t = β 1 V t − 1 + ( 1 − β 1 ) Δ W V_t=\beta_1 V_{t-1}+(1-\beta_1)\Delta W Vt=β1Vt−1+(1−β1)ΔW S t = β 2 S t − 1 + ( 1 − β 2 ) Δ W 2 S_t=\beta_2 S_{t-1} + (1-\beta_2 )\Delta W^2 St=β2St−1+(1−β2)ΔW2 V t ^ = V t 1 − β 1 2 \hat{V_t}=\frac{V_t}{1-\beta_1^2} Vt^=1−β12Vt S t ^ = S t 1 − β 2 2 \hat{S_t}=\frac{S_t}{1-\beta_2^2} St^=1−β22St W t ← W t − 1 + α V t ^ S t ^ + ε W_t\leftarrow W_{t-1}+\alpha \frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\varepsilon} Wt←Wt−1+αSt^+εVt^