机器学习的数学基础

1 线性代数

1.1 基本概念

  • 向量 (vector):未特殊说明的情况下,所有的向量都默认为列向量;
    x ⃗ = ( x 1 , . . . , x n ) T = [ x 1 ⋮ x n ] ∈ R n \vec{x}=(x_1,...,x_n)^T=\begin{bmatrix}x_1\\\vdots \\x_n \end{bmatrix}\in\mathbb{R}^n x =(x1,...,xn)T=x1xnRn

  • 矩阵 (matrix):同时也可以用 ( x i , j ) n × d (x_{i,j})_{n\times d} (xi,j)n×d [ x i , j ] n × d [x_{i,j}]_{n\times d} [xi,j]n×d 表达;
    X = [ x 1 , 1 ⋯ x 1 , d ⋮ ⋱ ⋮ x n , 1 ⋯ x n , d ] ∈ R n × d X=\begin{bmatrix} x_{1,1} & \cdots & x_{1,d} \\ \vdots & \ddots & \vdots \\ x_{n,1} & \cdots & x_{n,d} \end{bmatrix}\in\mathbb{R}^{n\times d} X=x1,1xn,1x1,dxn,dRn×d

1.2 运算

  • 点积 (dot product):向量乘积;其中,点积为 0 的两个向量相互 正交 (othogonal);
    x ⃗ T ⋅ y ⃗ = ∑ i x i y i \vec{x}^T\cdot \vec{y}=\sum_ix_iy_i x Ty =ixiyi

  • 矩阵乘积 (matrix multiplication):如其名;
    X n × d ⋅ Y d × c = [ ∑ i d x 1 , i y i , 1 ⋯ ∑ i d x 1 , i y i , c ⋮ ⋱ ⋮ ∑ i d x n , i y i , 1 ⋯ ∑ i d x n , i y i , c ] X_{n\times d}\cdot Y_{d\times c}=\begin{bmatrix} \sum_i^d x_{1,i}y_{i,1} & \cdots & \sum_i^d x_{1,i}y_{i,c} \\ \vdots & \ddots & \vdots \\ \sum_i^d x_{n,i}y_{i,1} & \cdots & \sum_i^d x_{n,i}y_{i,c} \end{bmatrix} Xn×dYd×c=idx1,iyi,1idxn,iyi,1idx1,iyi,cidxn,iyi,c

  • (trace): t r ( X ) tr(X) tr(X);对角线上所有元素之和;
    t r ( X ) = ∑ i x i , i tr(X)=\sum_ix_{i,i} tr(X)=ixi,i

  • 范数 (norm): F F F 等于 1 时称为 L 1 L_1 L1 范数, F F F 等于 2 时称为 L 2 L_2 L2 范数,依此类推;向量的 L 2 L_2 L2 范数中文里又称为向量的 ,可以简化为 ∣ ∣ x ⃗ ∣ ∣ ||\vec{x}|| x ,代表了向量的长度;
    ∣ ∣ X ∣ ∣ F = ( ∑ i , j ∣ x i , j ∣ F ) 1 F ||X||_F=\Big(\sum_{i,j}|x_{i,j}|^F\Big)^{\frac{1}{F}} XF=(i,jxi,jF)F1

  • 行列式 (determinant):记为 d e t ( X ) det(X) det(X) ∣ X ∣ |X| X;仅对于方阵;
    d e t ( X ) = ∑ i = 0 n − 1 ( ∏ j = 1 n x j , i + j − ∏ j = 0 n − 1 x n − j , i + j + 1 ) ,      x i , k = x i , k − n   if   k > n det(X)=\sum_{i=0}^{n-1}\Big(\prod_{j=1}^{n}x_{j,i+j}-\prod_{j=0}^{n-1}x_{n-j,i+j+1}\Big),~~~~x_{i,k}=x_{i,k-n}~~\text{if}~~k>n det(X)=i=0n1(j=1nxj,i+jj=0n1xnj,i+j+1),    xi,k=xi,kn  if  k>n

  • 余子式 (cofactor): M i , j M_{i,j} Mi,j 代表从 X X X 中删去第 i i i 行和第 j j j 列后矩阵的行列式;
    A i , j = ( − 1 ) i + j M i , j A_{i,j}=(-1)^{i+j}M_{i,j} Ai,j=(1)i+jMi,j

  • 伴随矩阵 (adjoint matrix):
    A ∗ = [ A 1 , 1 ⋯ A n , 1 ⋮ ⋱ ⋮ A 1 , n ⋯ A n , n ] A^*=\begin{bmatrix} A_{1,1} & \cdots & A_{n,1} \\ \vdots & \ddots & \vdots \\ A_{1,n} & \cdots & A_{n,n} \end{bmatrix} A=A1,1A1,nAn,1An,n

  • 逆矩阵 (inverse matrix): X − 1 X^{-1} X1 X − 1 X = I X^{-1}X=I X1X=I
    X − 1 = A ∗ d e t ( A ) X^{-1}=\frac{A^*}{det(A)} X1=det(A)A

  • 哈达马积 (Hadamard):又称为逐元素积;
    X ∘ Y = [ x 1 , 1 y 1 , 1 ⋯ x 1 , d y 1 , d ⋮ ⋱ ⋮ x n , 1 y n , 1 ⋯ x n , d y n , d ] X\circ Y=\begin{bmatrix} x_{1,1}y_{1,1} & \cdots & x_{1,d}y_{1,d} \\ \vdots & \ddots & \vdots \\ x_{n,1}y_{n,1} & \cdots & x_{n,d}y_{n,d} \end{bmatrix} XY=x1,1y1,1xn,1yn,1x1,dy1,dxn,dyn,d

  • 克罗内克积 (Kronecker product):
    X ⊗ Y = [ x 1 , 1 Y ⋯ x 1 , d Y ⋮ ⋱ ⋮ x n , 1 Y ⋯ x n , d Y ] X\otimes Y=\begin{bmatrix} x_{1,1}Y & \cdots & x_{1,d}Y \\ \vdots & \ddots & \vdots \\ x_{n,1}Y & \cdots & x_{n,d}Y \end{bmatrix} XY=x1,1Yxn,1Yx1,dYxn,dY

  • 笛卡尔乘积 (Cartisian product):
    A × B = { ( a , b ) ∣ a ∈ A , b ∈ B } A\times B=\{(a,b)|a\in A,b\in B\} A×B={(a,b)aA,bB}

1.3 求导

  • 一阶,对向量
    ∂ ( a ⃗ T x ⃗ ) ∂ x ⃗ = a ⃗ \frac{\partial (\vec{a}^T\vec{x})}{\partial \vec{x}}=\vec{a} x (a Tx )=a

∂ ( x ⃗ T A x ⃗ ) ∂ x ⃗ = ( A + A T ) x ⃗ \frac{\partial (\vec{x}^TA\vec{x})}{\partial \vec{x}}=(A+A^T)\vec{x} x (x TAx )=(A+AT)x

∂ [ ( A x ⃗ + a ⃗ ) T C ( B x ⃗ + b ⃗ ) ] ∂ x ⃗ = A T C ( B x ⃗ + b ⃗ ) + B T C ( A x ⃗ + a ⃗ ) \frac{\partial [(A\vec{x}+\vec{a})^TC(B\vec{x}+\vec{b})]}{\partial \vec{x}}=A^TC(B\vec{x}+\vec{b})+B^TC(A\vec{x}+\vec{a}) x [(Ax +a )TC(Bx +b )]=ATC(Bx +b )+BTC(Ax +a )

  • 一阶,对矩阵
    ∂ ( a ⃗ T X b ⃗ ) ∂ X = a ⃗ b ⃗ T = a ⃗ ⊗ b ⃗ \frac{\partial (\vec{a}^TX\vec{b})}{\partial X}=\vec{a}\vec{b}^T=\vec{a} \otimes \vec{b} X(a TXb )=a b T=a b

∂ ( a ⃗ T X T b ⃗ ) ∂ X = b ⃗ a ⃗ T = b ⃗ ⊗ a ⃗ \frac{\partial (\vec{a}^TX^T\vec{b})}{\partial X}=\vec{b}\vec{a}^T=\vec{b} \otimes \vec{a} X(a TXTb )=b a T=b a

∂ ( a ⃗ T X T X b ⃗ ) ∂ X = X ( a ⃗ b ⃗ T + b ⃗ a ⃗ T ) \frac{\partial (\vec{a}^TX^TX\vec{b})}{\partial X}=X(\vec{a}\vec{b}^T+\vec{b}\vec{a}^T) X(a TXTXb )=X(a b T+b a T)

∂ ( b ⃗ T X T A X c ⃗ ) ∂ X = A T X b ⃗ c ⃗ T + A X c ⃗ b ⃗ T \frac{\partial (\vec{b}^TX^TAX\vec{c})}{\partial X}=A^TX\vec{b}\vec{c}^T+AX\vec{c}\vec{b}^T X(b TXTAXc )=ATXb c T+AXc b T

∂ [ ( X b ⃗ + c ⃗ ) A ( X b ⃗ + c ⃗ ) ] ∂ X = ( A + A T ) ( X b ⃗ + c ⃗ ) b ⃗ T \frac{\partial [(X\vec{b}+\vec{c})A(X\vec{b}+\vec{c})]}{\partial X}=(A+A^T)(X\vec{b}+\vec{c})\vec{b}^T X[(Xb +c )A(Xb +c )]=(A+AT)(Xb +c )b T

1.4 偏导

以下仅呈现部分二维范围内的偏导结果;

  • 一阶,标量对标量
    ∂ z ∂ x \frac{\partial z}{\partial x} xz

  • 一阶,标量对向量
    ∂ z ∂ x ⃗ = ( ∂ z ∂ x 1 , . . . , ∂ z ∂ x n ) T \frac{\partial z}{\partial \vec{x}}=\big(\frac{\partial z}{\partial x_1},...,\frac{\partial z}{\partial x_n}\big)^T x z=(x1z,...,xnz)T

  • 一阶,标量对矩阵
    ∂ z ∂ X = [ ∂ z ∂ x 1 , 1 ⋯ ∂ z ∂ x 1 , d ⋮ ⋱ ⋮ ∂ z ∂ x n , 1 ⋯ ∂ z ∂ x n , d ] \frac{\partial z}{\partial X}=\begin{bmatrix} \frac{\partial z}{\partial x_{1,1}} & \cdots & \frac{\partial z}{\partial x_{1,d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z}{\partial x_{n,1}} & \cdots & \frac{\partial z}{\partial x_{n,d}} \end{bmatrix} Xz=x1,1zxn,1zx1,dzxn,dz

  • 一阶,向量对标量
    ∂ z ⃗ ∂ x = ( ∂ z 1 ∂ x , . . . , ∂ z n ∂ x ) T \frac{\partial \vec{z}}{\partial x}=\big(\frac{\partial z_1}{\partial x},...,\frac{\partial z_n}{\partial x}\big)^T xz =(xz1,...,xzn)T

  • 一阶,向量对向量:即 雅各比矩阵 (Jaccob matrix);
    ∂ z ⃗ ∂ x ⃗ = [ ∂ z 1 ∂ x 1 ⋯ ∂ z 1 ∂ x d ⋮ ⋱ ⋮ ∂ z n ∂ x 1 ⋯ ∂ z n ∂ x d ] \frac{\partial \vec{z}}{\partial \vec{x}}=\begin{bmatrix} \frac{\partial z_1}{\partial x_{1}} & \cdots & \frac{\partial z_1}{\partial x_{d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_n}{\partial x_{1}} & \cdots & \frac{\partial z_n}{\partial x_{d}} \end{bmatrix} x z =x1z1x1znxdz1xdzn

  • 一阶,矩阵对标量
    ∂ Z ∂ x = [ ∂ z 1 , 1 ∂ x ⋯ ∂ z 1 , d ∂ x ⋮ ⋱ ⋮ ∂ z n , 1 ∂ x ⋯ ∂ z n , d ∂ x ] \frac{\partial Z}{\partial x}=\begin{bmatrix} \frac{\partial z_{1,1}}{\partial x} & \cdots & \frac{\partial z_{1,d}}{\partial x} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_{n,1}}{\partial x} & \cdots & \frac{\partial z_{n,d}}{\partial x} \end{bmatrix} xZ=xz1,1xzn,1xz1,dxzn,d

  • 二阶,标量对向量:即 海森矩阵 (Hessian matrix);
    ∂ z ∂ x ⃗ = [ ∂ 2 z ∂ x 1 ∂ x 1 ⋯ ∂ 2 z ∂ x 1 ∂ x d ⋮ ⋱ ⋮ ∂ 2 z ∂ x n ∂ x 1 ⋯ ∂ 2 z ∂ x n ∂ x d ] \frac{\partial z}{\partial \vec{x}}=\begin{bmatrix} \frac{\partial^2 z}{\partial x_{1}\partial x_{1}} & \cdots & \frac{\partial^2 z}{\partial x_{1}\partial x_{d}} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 z}{\partial x_{n}\partial x_{1}} & \cdots & \frac{\partial^2 z}{\partial x_{n}\partial x_{d}} \end{bmatrix} x z=x1x12zxnx12zx1xd2zxnxd2z

1.5 矩阵分解

  • 特征分解 (eigen decomposition):从方阵 A A A 中提取满足 λ x ⃗ = A x ⃗ \lambda \vec{x}=A\vec{x} λx =Ax λ \lambda λ x ⃗ \vec{x} x ;其中, λ \lambda λ 称为 特征值 (eigen value), x ⃗ \vec{x} x 称为 A A A特征向量 (eigen vector); Σ \Sigma Σ 为包含特征值的对角矩阵, W W W 中的第 i i i 列与 Σ \Sigma Σ 的第 i i i 个元素分别为矩阵 A A A 的第 i i i 个特征向量和第 i i i 个特征值;同时,所有的特征向量相互正交; W W T = I WW^T=I WWT=I
    A = W Σ W T A=W\Sigma W^{T} A=WΣWT

  • 奇异值分解 (singular value decomposition):当 A A A 不为方阵时,转而对 A A A 的协方差矩阵进行的特征分解;得出的 U U U 称为 A A A左奇异矩阵 (left singular matrix),列向量为 左奇异向量 (left sigular vector); V V V 称为 A A A右奇异矩阵 (right singular matrix),列向量同理; Σ \Sigma Σ 为奇异值矩阵,对角线上依次排列着原矩阵 A A A 从大到小特征值的平方根; U V T = I UV^T=I UVT=I
    A T A = U Σ V T A^TA=U\Sigma V^T ATA=UΣVT

1.6 正定矩阵

  • 半正定 (positive semi-definite):对于任意 x x x x T A x ≥ 0 x^TAx\ge 0 xTAx0;所有特征值 λ i ≥ 0 \lambda_i\ge 0 λi0
  • 正定 (positive definite):对于任意 x x x x T A x > 0 x^TAx> 0 xTAx>0;所有特征值 λ i > 0 \lambda_i> 0 λi>0
  • 半负定 (negative semi-definite):对于任意 x x x x T A x ≤ 0 x^TAx\le 0 xTAx0;所有特征值 λ i ≤ 0 \lambda_i\le 0 λi0
  • 负定 (negative definite):对于任意 x x x x T A x < 0 x^TAx< 0 xTAx<0;所有特征值 λ i < 0 \lambda_i<0 λi<0
  • 不定 (indefinite):既非半正定,也非半负定;特征值有正有负。

1.7 相似性

  • 余弦相似度 (Cosine Similarity):空间中向量夹角的余弦值,用于衡量向量的方向是否一致;
    c o s ( x ⃗ , y ⃗ ) = x ⃗ ⋅ y ⃗ ∣ ∣ x ⃗ ∣ ∣ ⋅ ∣ ∣ y ⃗ ∣ ∣ cos(\vec{x},\vec{y})=\frac{\vec{x}\cdot \vec{y}}{||\vec{x}||\cdot ||\vec{y}||} cos(x ,y )=x y x y
  • 欧式距离 (Euclidean Distance):两点之间的最短距离,是对于向量长度和方向的综合评价标准;
    Euclidean ( x ⃗ , y ⃗ ) = ∣ ∣ x ⃗ − y ⃗ ∣ ∣ = ( ∑ i ∣ x i − y i ∣ 2 ) 1 2 \text{Euclidean}(\vec{x},\vec{y})=||\vec{x}-\vec{y}||=\big(\sum_i|x_i-y_i|^2\big)^{\frac{1}{2}} Euclidean(x ,y )=x y =(ixiyi2)21
  • 曼哈顿距离 (Manhattan Distance):两点之间的棋盘距离,在特定场景下效用显著;
    Manhattan ( x ⃗ , y ⃗ ) = ∑ i ∣ x ⃗ i − y ⃗ i ∣ \text{Manhattan}(\vec{x},\vec{y})=\sum_i|\vec{x}_i-\vec{y}_i| Manhattan(x ,y )=ix iy i
  • 闵氏距离 (Minkowski Distance):欧式距离和曼哈顿距离的泛化版本;
    Minkowski ( x ⃗ , y ⃗ ) = ( ∑ i ∣ x i − y i ∣ p ) 1 p \text{Minkowski}(\vec{x},\vec{y})=\big(\sum_i|x_i-y_i|^p\big)^{\frac{1}{p}} Minkowski(x ,y )=(ixiyip)p1
  • 相关系数 (Correlation):统计学的常见概念,衡量同质化向量的元素协同变化趋势;
    ρ ( x ⃗ , y ⃗ ) = C o v ( x ⃗ , y ⃗ ) σ x ⃗ ⋅ σ y ⃗ = E ( x y ) − E ( x ) E ( y ) x ) ⋅ V a r ( y ) \rho(\vec{x},\vec{y})=\frac{Cov(\vec{x},\vec{y})}{\sigma_{\vec{x}} \cdot \sigma_{\vec{y}}}=\frac{E(xy)-E(x)E(y)}{\sqrt{x)\cdot Var(y)}} ρ(x ,y )=σx σy Cov(x ,y )=x)Var(y) E(xy)E(x)E(y)
  • KL 散度 (Kullback-Leibler Divergence):又称为相对熵,是信息论领域的重要概念,衡量两种概率分布的单向相似度;
    KL Divergence ( p ∣ ∣ q ) = ∑ x ∈ q p ( x ) l o g p ( x ) q ( x ) \text{KL Divergence}(p||q)=\sum_{x\in q}p(x)log\frac{p(x)}{q(x)} KL Divergence(pq)=xqp(x)logq(x)p(x)
  • Jaccard 相似度 (Jaccard Similarity):交集长度除以并集长度,衡量两种集合或布尔序列的相似度;
    Jaccard Similarity ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ \text{Jaccard Similarity}(A,B)=\frac{|A\cap B|}{|A\cup B|} Jaccard Similarity(A,B)=ABAB

2 概率论

2.1 基本概念

假定 X X X 为离散型变量, x x x 为连续型变量;

  • 概率质量函数 (PMF, abbr. probability mass function); P ( X ) P(X) P(X)
    ∑ i P ( X i ) = 1 \sum_iP(X_i)=1 iP(Xi)=1

  • 概率密度函数 (PDF, abbr. probability density function); p ( x ) p(x) p(x)
    ∫ p ( x ) d x = 1 \int p(x)dx=1 p(x)dx=1

  • 条件概率 (conditional probability):
    P ( Y ∣ X ) P(Y|X) P(YX)

  • 联合概率 (joint probability):
    P ( X , Y ) = P ( Y ∣ X ) P ( X ) P(X,Y)=P(Y|X)P(X) P(X,Y)=P(YX)P(X)

  • 全概率 (total probability):
    P ( Y ) = ∑ i P ( Y ∣ X i ) P ( X ) P(Y)=\sum_iP(Y|X_i)P(X) P(Y)=iP(YXi)P(X)

  • 链式法则 (chain rule):
    P ( x 1 , . . . , x n ) = P ( x 1 ) ∏ i = 2 n P ( x i ∣ x i − 1 , . . . , x 1 ) P(x_1,...,x_n)=P(x_1)\prod_{i=2}^nP(x_i|x_{i-1},...,x_1) P(x1,...,xn)=P(x1)i=2nP(xixi1,...,x1)

  • 独立事件 (independent events):记作 X ⊥ Y X\perp Y XY
    P ( X , Y ) = P ( X ) P ( Y ) P(X,Y)=P(X)P(Y) P(X,Y)=P(X)P(Y)

  • 条件独立事件 (conditional independent events):记作 X ⊥ Y ∣ Z X\perp Y| Z XYZ
    P ( X , Y ∣ Z ) = P ( X ∣ Z ) P ( Y ∣ Z ) P(X,Y|Z)=P(X|Z)P(Y|Z) P(X,YZ)=P(XZ)P(YZ)

  • 联合概率分布 (joint probability distribution): p ( x , y ) p(x,y) p(x,y)
    P ( X ≤ a , Y ≤ b ) = ∫ − ∞ a ∫ − ∞ b p ( x , y ) d x d y = ∫ − ∞ a p x ( x ) d x = ∫ − ∞ b p y ( y ) d y P(X\le a,Y\le b)=\int_{-\infty}^a\int_{-\infty}^b p(x,y)dxdy=\int_{-\infty}^ap_x(x)dx=\int_{-\infty}^bp_y(y)dy P(Xa,Yb)=abp(x,y)dxdy=apx(x)dx=bpy(y)dy

  • 先验概率 (prior probability):
    P ( X ) P(X) P(X)

  • 后验概率 (posterior probability):
    P ( X ∣ Y ) P(X|Y) P(XY)

  • 贝叶斯公式 (bayes formula):
    P ( X i ∣ Y ) = P ( Y ∣ X i ) P ( X i ) ∑ i P ( Y ∣ X i ) P ( X i ) P(X_i|Y)=\frac{P(Y|X_i)P(X_i)}{\sum_iP(Y|X_i)P(X_i)} P(XiY)=iP(YXi)P(Xi)P(YXi)P(Xi)

2.2 期望与方差

  • 期望 (mean):常记为 μ \mu μ;若级数/极限不收敛,则期望不存在;
    E [ x ] = ∫ p ( x ) x d x \mathbb{E}[x]=\int p(x)xdx E[x]=p(x)xdx

  • 方差 (variance):常记为 σ 2 \sigma^2 σ2,表示 标准差 (standard deviation) 的平方;
    V a r [ x ] = E [ ( x − μ ) 2 ] = E [ x 2 ] − ( E [ x ] ) 2 Var[x]=\mathbb{E}[(x-\mu)^2]=\mathbb{E}[x^2]-(\mathbb{E}[x])^2 Var[x]=E[(xμ)2]=E[x2](E[x])2

  • 协方差 (covariance):
    C o v [ x , y ] = E [ ( x − E [ x ] ) ( y − E [ y ] ) ] = E [ x y ] − E [ x ] E [ y ] Cov[x,y]=\mathbb{E}[(x-\mathbb{E}[x])(y-\mathbb{E}[y])]=\mathbb{E}[xy]-\mathbb{E}[x]\mathbb{E}[y] Cov[x,y]=E[(xE[x])(yE[y])]=E[xy]E[x]E[y]

  • 相关系数 (correlation):常记为 ρ \rho ρ
    C o r r [ x , y ] = C o v [ x , y ] σ x σ y ∈ [ − 1 , 1 ] Corr[x,y]=\frac{Cov[x,y]}{\sigma_x\sigma_y}\in[-1,1] Corr[x,y]=σxσyCov[x,y][1,1]

  • 其他性质

E [ x y ] = E [ x ] E [ y ] \mathbb{E}[xy]=\mathbb{E}[x]\mathbb{E}[y] E[xy]=E[x]E[y]

E [ k x + y ] = k E [ x ] + E [ y ] \mathbb{E}[kx+y]=k\mathbb{E}[x]+\mathbb{E}[y] E[kx+y]=kE[x]+E[y]

V a r [ k x + y ] = k 2 V a r [ x ] + V a r [ y ] + 2 C o v [ k x , y ] Var[kx+y]=k^2Var[x]+Var[y]+2Cov[kx,y] Var[kx+y]=k2Var[x]+Var[y]+2Cov[kx,y]

C o v [ k x 1 + x 2 , y ] = C o v [ k x 1 , y ] + C o v [ x 2 , y ] Cov[kx_1+x_2,y]=Cov[kx1,y]+Cov[x2,y] Cov[kx1+x2,y]=Cov[kx1,y]+Cov[x2,y]

2.3 概率分布

在这里插入图片描述

  • 均匀分布 (uniform distribution): μ = a + b 2 \mu=\frac{a+b}{2} μ=2a+b σ 2 = ( b − a ) 2 12 \sigma^2=\frac{(b-a)^2}{12} σ2=12(ba)2
    p ( x ) = 1 b − a ,      x ∈ [ a , b ] p(x)=\frac{1}{b-a},~~~~x\in[a,b] p(x)=ba1,    x[a,b]

  • 伯努利分布 (Bernoulli distribution):又称为 二项分布 μ = n ϕ \mu=n\phi μ=nϕ σ 2 = n ϕ ( 1 − ϕ ) \sigma^2=n\phi(1-\phi) σ2=nϕ(1ϕ)
    p ( x ) = ( n x ) ϕ x ( 1 − ϕ ) n − x ,      x ∈ Z + p(x)= \dbinom{n}{x}\phi^x(1-\phi)^{n-x},~~~~x\in \mathbb{Z^+} p(x)=(xn)ϕx(1ϕ)nx,    xZ+

  • 正太分布 (normal distribution):又称为 高斯分布 (Gaussian distribution),记为 x ∼ N ( μ , σ 2 ) x\sim N(\mu,\sigma^2) xN(μ,σ2)
    p ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 ,      x ∈ R p(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x-\mu)^2}{2\sigma^2}},~~~~x\in\mathbb{R} p(x)=2π σ1e2σ2(xμ)2,    xR

  • 对数正态分布 (log-normal distribution): x x x 的对数符合正态分布,即 log ⁡ ( x ) ∼ N ( μ , σ 2 ) \log(x)\sim N(\mu,\sigma^2) log(x)N(μ,σ2)
    p ( x ) = 1 2 π σ e − ( log ⁡ ( x ) − μ ) 2 2 σ 2 ,      x ∈ R + p(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(\log(x)-\mu)^2}{2\sigma^2}},~~~~x\in\mathbb{R^+} p(x)=2π σ1e2σ2(log(x)μ)2,    xR+

  • 多维正态分布 (multinormal distribution):以二维为例;
    p ( x 1 , x 2 ) = 1 2 π σ 1 σ 2 1 − ρ 2 e − 1 2 ( 1 − ρ 2 ) [ ( x 1 − μ 1 ) 2 σ 1 2 − 2 ρ ( x 1 − μ 1 ) ( x 2 − μ 2 ) σ 1 σ 2 + ( x 2 − μ 2 ) 2 σ 2 2 ] ,      x 1 , x 2 ∈ R p(x_1,x_2)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}e^{\frac{-1}{2(1-\rho^2)}[\frac{(x_1-\mu_1)^2}{\sigma_1^2}-2\rho\frac{(x_1-\mu_1)(x_2-\mu_2)}{\sigma_1\sigma_2}+\frac{(x_2-\mu_2)^2}{\sigma_2^2}]},~~~~x_1,x_2\in\mathbb{R} p(x1,x2)=2πσ1σ21ρ2 1e2(1ρ2)1[σ12(x1μ1)22ρσ1σ2(x1μ1)(x2μ2)+σ22(x2μ2)2],    x1,x2R

  • 指数分布 (exponential distribution): μ = 1 λ \mu=\frac{1}{\lambda} μ=λ1 σ 2 = 1 λ 2 \sigma^2=\frac{1}{\lambda^2} σ2=λ21
    p ( x ) = λ e λ x ,      x ∈ R + p(x)=\frac{\lambda}{e^{\lambda x}},~~~~x\in\mathbb{R^+} p(x)=eλxλ,    xR+

  • 泊松分布 (Poisson distribution): μ = λ \mu=\lambda μ=λ σ 2 = λ \sigma^2=\lambda σ2=λ
    p ( x ) = λ x x ! e − λ ,      x ∈ Z + p(x)=\frac{\lambda^x}{x!}e^{-\lambda},~~~~x\in\mathbb{Z}^+ p(x)=x!λxeλ,    xZ+

  • 拉普拉斯分布 (Laplace distribution): σ = 2 γ 2 \sigma=2\gamma^2 σ=2γ2
    p ( x ) = 1 2 γ e − ∣ x − μ ∣ γ ,      x ∈ R p(x)=\frac{1}{2\gamma}e^{-\frac{|x-\mu|}{\gamma}},~~~~x\in\mathbb{R} p(x)=2γ1eγxμ,    xR

  • 贝塔分布 (Beta distribution):记为 x ∼ B ( α , β ) x\sim B(\alpha,\beta) xB(α,β) μ = α α + β \mu=\frac{\alpha}{\alpha+\beta} μ=α+βα伽马函数 (Gamma function) Γ ( x ) = ∫ 0 ∞ t x − 1 e − t d t = ( x − 1 ) ! \Gamma(x)=\int_0^\infty t^{x-1}e^{-t}dt=(x-1)! Γ(x)=0tx1etdt=(x1)!
    p ( x ) = Γ ( α + β ) Γ ( α ) Γ ( β ) x α − 1 ( 1 − x ) β − 1 ,      x ∈ Z + p(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1},~~~~x\in\mathbb{Z^+} p(x)=Γ(α)Γ(β)Γ(α+β)xα1(1x)β1,    xZ+

  • 迪利克雷分布 (Dirichlet distribution):贝塔分布在二维以上的情形;
    Dirichlet ( x 1 , . . . , x k ) = Γ ( n 1 + . . . + n k ) Γ ( n 1 ) . . . Γ ( n k ) x 1 n 1 − 1 . . . x k n k − 1 ,      x 1 , . . . , x k ∈ Z + \text{Dirichlet}(x_1,...,x_k)=\frac{\Gamma(n_1+...+n_k)}{\Gamma(n_1)...\Gamma(n_k)}x_1^{n_1-1}...x_k^{n_k-1},~~~~x_1,...,x_k\in\mathbb{Z^+} Dirichlet(x1,...,xk)=Γ(n1)...Γ(nk)Γ(n1+...+nk)x1n11...xknk1,    x1,...,xkZ+

  • 混合概率分布 (mixed probability distribution):
    p ( x ) = ∑ i P ( c i ) p ( x ∣ c i ) p(x)=\sum_iP(c_i)p(x|c_i) p(x)=iP(ci)p(xci)

2.4 信息论

  • (entropy):描绘一组概率分布的混乱程度;
    H ( X ) = − ∑ x ∈ X p ( x ) log ⁡ p ( x ) ∈ R + H(X)=-\sum_{x\in X}p(x)\log p(x)\in\mathbb{R^+} H(X)=xXp(x)logp(x)R+

  • 条件熵 (conditional entropy):条件概率的信息混乱程度;
    H ( Y ∣ X ) = − ∑ x ∈ X p ( x ) ∑ y ∈ Y p ( y ∣ x ) log ⁡ p ( y ∣ x ) H(Y|X)=-\sum_{x\in X}p(x)\sum_{y\in Y}p(y|x)\log p(y|x) H(YX)=xXp(x)yYp(yx)logp(yx)

  • 互信息 (mutual information):一组概率分布因另一组概率分布而减少的信息不确定性;
    I ( X ; Y ) = ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log ⁡ p ( x , y ) p ( x ) p ( y ) I(X;Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)} I(X;Y)=xXyYp(x,y)logp(x)p(y)p(x,y)

  • KL 散度 (Kullback-Leibler Divergence):又称为 相对熵,衡量两种概率分布的单向相似度;
    KL Divergence ( Y ∣ ∣ X ) = ∑ x ∈ X p ( x ) log ⁡ p ( x ) q ( x ) \text{KL Divergence}(Y||X)=\sum_{x\in X}p(x)\log\frac{p(x)}{q(x)} KL Divergence(YX)=xXp(x)logq(x)p(x)

  • 交叉熵 (cross entropy):衡量两种标签维度上的概率准确度;
    Cross Entropy ( p ) = − y log ⁡ p ( x ) − ( 1 − y ) log ⁡ [ 1 − p ( x ) ] ,      y ∈ { 0 , 1 } \text{Cross Entropy}(p)=-y\log p(x)-(1-y)\log [1-p(x)],~~~~y\in\{0,1\} Cross Entropy(p)=ylogp(x)(1y)log[1p(x)],    y{0,1}

3 最优化

3.1 基本概念

在机器学习中,最优化 (optimization) 的目的在于求得参数 θ \theta θ, 使得损失函数最小化:
min ⁡ θ L ( θ ) \min_\theta L(\theta) θminL(θ) s . t . { h i ( θ ) = 0 g j ( θ ) ≤ 0 s.t.\begin{cases} h_i(\theta)=0 \\ g_j(\theta)\le 0 \end{cases} s.t.{hi(θ)=0gj(θ)0

损失函数 (loss function) L L L 也可称为目标函数或代价函数; h h h g g g约束条件 (constraint),诸如 h h h 的约束条件称为 等式约束 (equality constraint),诸如 g g g 的约束条件称为 不等式约束 (inequality constraint)。其他概念:

  • 无约束优化问题 (unconstraint optimization problem):不存在约束条件时的优化问题;
  • 有约束优化问题 (constraint optimization problem):存在约束条件时的优化问题;
  • 凸优化 (convex optimization): L L L h h h g g g 皆为凸函数时的优化问题;
  • 线性规划 (linear programming): L L L h h h g g g 皆为线性函数时的优化问题;
  • 非线性规划 (nonlinear programming): L L L h h h g g g 中任意一个为非线性函数时的优化问题;
  • 二次规划 (quadratic programming): L L L 为二次函数,而 h h h g g g 为线性函数时的优化问题;
  • 多目标规划 (multi-objective programming): L L L 的输出为向量时的优化问题。

关于函数凸分析的一些等价信息:

ConvexConcave
概念 λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f [ λ x + ( 1 − λ ) y ]    ∀ x , y  in  X \lambda f(x)+(1-\lambda) f(y)\ge f[\lambda x+(1-\lambda)y]~~\forall x,y~\text{in}~X λf(x)+(1λ)f(y)f[λx+(1λ)y]  x,y in X λ f ( x ) + ( 1 − λ ) f ( y ) ≤ f [ λ x + ( 1 − λ ) y ]    ∀ x , y  in  X \lambda f(x)+(1-\lambda) f(y)\le f[\lambda x+(1-\lambda)y]~~\forall x,y~\text{in}~X λf(x)+(1λ)f(y)f[λx+(1λ)y]  x,y in X
属性The Hessian Matrix ∇ 2 f ( x ) \nabla^2 f(x) 2f(x) is positive semi-definiteThe Hessian Matrix ∇ 2 f ( x ) \nabla^2 f(x) 2f(x) is negative semi-definite

3.2 拉格朗日乘子法

在求解有约束优化问题时,常常通过拉格朗日乘子法,将约束函数代入到目标函数中。以以下优化问题为例:
min ⁡ x f ( x ) \min_xf(x) xminf(x) s . t . { h ( x ) = 0 g ( x ) ≤ 0 s.t. \begin{cases}h(x)=0\\g(x)\le 0\end{cases} s.t.{h(x)=0g(x)0

引入拉格朗日乘子 λ \lambda λ,将原问题转换为最大化拉格朗日函数的无约束优化 对偶问题 (dual problem):
min ⁡ x max ⁡ λ 1 , λ 2 L = f ( x ) + λ 1 h ( x ) + λ 2 g ( x ) \min_{x}\max_{\lambda_1,\lambda_2}L=f(x)+\lambda_1h(x)+\lambda_2g(x) xminλ1,λ2maxL=f(x)+λ1h(x)+λ2g(x)

建立 KKT 条件 (Karush-kuhn-Tucker condition):
{ ∂ L ∂ x = 0 ∂ L ∂ λ 1 = 0 ∂ L ∂ λ 2 = 0 h ( x ) = 0 λ 2 g ( x ) = 0 λ 2 ≥ 0 g ( x ) ≤ 0 \begin{cases} \frac{\partial L}{\partial x}=0 \\ \frac{\partial L}{\partial \lambda_1}=0 \\ \frac{\partial L}{\partial \lambda_2}=0 \\ h(x)=0 \\ \lambda_2g(x)=0\\ \lambda_2\ge 0\\ g(x)\le0 \end{cases} xL=0λ1L=0λ2L=0h(x)=0λ2g(x)=0λ20g(x)0

解出满足以上条件的 x x x 代入目标函数 f ( x ) f(x) f(x) 中,得出最优解。

3.3 凸优化

  • 牛顿法 (Newton’s method):泰勒的二阶展开式:
    f ( x ) = f ( x 0 ) + f ′ ( x o ) ( x − x 0 ) + f ′ ′ ( x o ) 2 ! ( x − x 0 ) 2 + O ( ( x − x 0 ) 2 ) f(x)=f(x_0)+f'(x_o)(x-x_0)+\frac{f''(x_o)}{2!}(x-x_0)^2+O((x-x_0)^2) f(x)=f(x0)+f(xo)(xx0)+2!f(xo)(xx0)2+O((xx0)2)使用参数作为二阶展开式输入,损失函数作为输出, L = f ( W t − 1 + Δ W ) ≈ f ( W t − 1 ) + f ′ ( W t − 1 ) Δ W + f ′ ′ ( W t − 1 ) 2 ! Δ W 2 L=f(W_{t-1}+\Delta W)\approx f(W_{t-1})+f'(W_{t-1})\Delta W+\frac{f''(W_{t-1})}{2!}\Delta W^2 L=f(Wt1+ΔW)f(Wt1)+f(Wt1)ΔW+2!f(Wt1)ΔW2上式可以改写为: L ( Δ W ) = f ′ ′ ( W t − 1 ) 2 ! Δ W 2 + f ′ ( W t − 1 ) Δ W + f ( W t − 1 ) L(\Delta W)=\frac{f''(W_{t-1})}{2!}\Delta W^2+f'(W_{t-1})\Delta W+f(W_{t-1}) L(ΔW)=2!f(Wt1)ΔW2+f(Wt1)ΔW+f(Wt1)为求解 Δ W \Delta W ΔW,令 L ′ ( Δ W ) = 0 L'(\Delta W)=0 L(ΔW)=0 f ′ ′ ( W t − 1 ) Δ W + f ′ ( W t − 1 ) = 0 f''(W_{t-1})\Delta W+f'(W_{t-1})=0 f(Wt1)ΔW+f(Wt1)=0目标参数的更新值 Δ W = − f ′ ′ ( W t − 1 ) − 1 f ′ ( W t − 1 ) \Delta W=-f''(W_{t-1})^{-1}f'(W_{t-1}) ΔW=f(Wt1)1f(Wt1)若损失 L L L 为标量,而 W W W 为向量,则 f ′ ′ ( W t − 1 ) f''(W_{t-1}) f(Wt1) 为海森矩阵,因此上式又可以表示为: Δ W = − H t − 1 − 1 f ′ ( W t − 1 ) \Delta W=-H_{t-1}^{-1}f'(W_{t-1}) ΔW=Ht11f(Wt1)计算完成,对原参数进行更新:
    W t ← W t − 1 + α Δ W W_t\leftarrow W_{t-1}+\alpha \Delta W WtWt1+αΔW

  • 拟牛顿法 (Quasi-Newton’s method):为解决牛顿法直接计算海森矩阵逆矩阵的低效,采用生成海森矩阵近似的方式进行优化,知名的拟牛顿法包括 DFP 算法和 BFGS 算法,感兴趣的读者自行了解。

  • SGD (stochastic gradient descent):应用在小批量数据的梯度下降法,原理与梯度下降法相同;给定模型输出的损失函数 L = f ( Z ( k ) , Y ) L=f(Z^{(k)},Y) L=f(Z(k),Y) 及最后一层模型参数 W ( k ) W^{(k)} W(k) W ( k ) W^{(k)} W(k) 的梯度表示为:
    ∇ W ( k ) = ∂ L ∂ W ( k ) = ∂ L ∂ Z ( k ) ∂ Z ( k ) ∂ W ( k ) \nabla W^{(k)}=\frac{\partial L}{\partial W^{(k)}}=\frac{\partial L}{\partial Z^{(k)}}\frac{\partial Z^{(k)}}{\partial W^{(k)}} W(k)=W(k)L=Z(k)LW(k)Z(k)对于模型上游的参数 W ( i ) ( i < k ) W^{(i)}(i<k) W(i)(i<k),梯度的反向传导遵循链式法则:
    ∇ W ( i ) = ∂ L ∂ W ( i ) = ∂ L ∂ Z ( k ) ∂ Z ( k ) ∂ Z ( k − 1 ) . . . ∂ Z ( i ) ∂ W ( i ) \nabla W^{(i)}=\frac{\partial L}{\partial W^{(i)}}=\frac{\partial L}{\partial Z^{(k)}}\frac{\partial Z^{(k)}}{\partial Z^{(k-1)}}...\frac{\partial Z^{(i)}}{\partial W^{(i)}} W(i)=W(i)L=Z(k)LZ(k1)Z(k)...W(i)Z(i)梯度计算完成后,使用负梯度对参数进行更新,其中 α \alpha α 为学习率:
    W t ( i ) ← W t − 1 ( i ) + ( − α ∇ W ( i ) ) W_t^{(i)}\leftarrow W_{t-1}^{(i)}+(-\alpha \nabla W^{(i)}) Wt(i)Wt1(i)+(αW(i))

  • Momentum:为避免参数更新方向的震荡,引入动量因子 β \beta β
    V t = β V t − 1 + ( 1 − β ) Δ W V_t=\beta V_{t-1}+(1-\beta)\Delta W Vt=βVt1+(1β)ΔW W t ← W t − 1 + α V t W_t\leftarrow W_{t-1}+\alpha V_t WtWt1+αVt

  • RMSProp:为避免参数更新时学习率过大,引入动态学习率调整机制,
    S t = β S t − 1 + ( 1 − β ) Δ W 2 S_t=\beta S_{t-1} + (1-\beta )\Delta W^2 St=βSt1+(1β)ΔW2 W t ← W t − 1 + α Δ W S t + ε W_t\leftarrow W_{t-1}+\alpha \frac{\Delta W}{\sqrt{S_t}+\varepsilon} WtWt1+αSt +εΔW

  • Adam:结合 Momentum 与 RMSProp 两者的考虑,
    V t = β 1 V t − 1 + ( 1 − β 1 ) Δ W V_t=\beta_1 V_{t-1}+(1-\beta_1)\Delta W Vt=β1Vt1+(1β1)ΔW S t = β 2 S t − 1 + ( 1 − β 2 ) Δ W 2 S_t=\beta_2 S_{t-1} + (1-\beta_2 )\Delta W^2 St=β2St1+(1β2)ΔW2 V t ^ = V t 1 − β 1 2 \hat{V_t}=\frac{V_t}{1-\beta_1^2} Vt^=1β12Vt S t ^ = S t 1 − β 2 2 \hat{S_t}=\frac{S_t}{1-\beta_2^2} St^=1β22St W t ← W t − 1 + α V t ^ S t ^ + ε W_t\leftarrow W_{t-1}+\alpha \frac{\hat{V_t}}{\sqrt{\hat{S_t}}+\varepsilon} WtWt1+αSt^ +εVt^

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值