LR的决策函数为
h ( x ) = σ ( θ T x ) = 1 1 + e − θ T x (1) h(\boldsymbol x)=\sigma(\boldsymbol \theta^T \boldsymbol x)=\frac{1}{1+e^{-\boldsymbol \theta^T \boldsymbol x}} \tag1 h(x)=σ(θTx)=1+e−θTx1(1)
其中 σ ( z ) = 1 1 + e − z \sigma(z)=\frac 1{1+e^{-z}} σ(z)=1+e−z1,称为sigmoid函数
设 h ( x ) h(\boldsymbol x) h(x)表示该样本为正例的概率,将其视为类后验概率估计 p ( y = 1 ∣ x ; θ ) p(y=1|\boldsymbol x;\boldsymbol \theta) p(y=1∣x;θ),则:
p ( y = 1 ∣ x ; θ ) = h ( x ) (2) p(y=1|\boldsymbol x;\boldsymbol \theta)=h (\boldsymbol x) \tag2 p(y=1∣x;θ)=h(x)(2)
p ( y = 0 ∣ x ; θ ) = 1 − h ( x ) (3) p(y=0|\boldsymbol x;\boldsymbol \theta)=1-h (\boldsymbol x) \tag3 p(y=0∣x;θ)=1−h(x)(3)
合并式 ( 2 ) ( 3 ) (2)(3) (2)(3)得到
p ( y ∣ x ; θ ) = h ( x ) y ( 1 − h ( x ) ) 1 − y (4) p(y|\boldsymbol x;\boldsymbol \theta)=h (\boldsymbol x)^y(1-h(\boldsymbol x))^{1-y} \tag4 p(y∣x;θ)=h(x)y(1−h(x))1−y(4)
我们可以使用极大似然估计来得到参数 θ \theta θ,似然函数为
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m h ( x ( i ) ) y ( i ) ( 1 − h ( x ( i ) ) ) 1 − y ( i ) (5) L(\boldsymbol \theta)=\prod_{i=1}^mp(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta)=\prod_{i=1}^m h(\boldsymbol x^{(i)})^{y^{(i)}} (1-h(\boldsymbol x^{(i)}))^{1-y^{(i)}} \tag5 L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏mh(x(i))y(i)(1−h(x(i)))1−y(i)(5)
其中 m m m为数据集的样本个数.
由于取对数不影响单调性且可以避免一些数值问题,取对数可得
log L ( θ ) = ∑ i = 1 m y ( i ) log ( h ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h ( x ( i ) ) ) (6) \log L(\boldsymbol \theta)= \sum_{i=1}^m y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag6 logL(θ)=i=1∑my(i)log(h(x(i)))+(1−y(i))log(1−h(x(i)))(6)
最大化式 ( 6 ) (6) (6)等价于最小化下列损失函数,刚好就是交叉熵损失函数:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h ( x ( i ) ) ) (7) J(\boldsymbol \theta)= -\frac1m\sum_{i=1}^m y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag7 J(θ)=−m1i=1∑my(i)log(h(x(i)))+(1−y(i))log(1−h(x(i)))(7)
为推导简便,令 J i J_i Ji表示 J ( θ ) J(\theta) J(θ)的第 i i i项,对应了第 i i i个样本,即
J ( θ ) = − 1 m ∑ i = 1 m J i ( θ ) (8) J(\boldsymbol \theta)= -\frac1m\sum_{i=1}^m J_i(\boldsymbol \theta) \tag8 J(θ)=−m1i=1∑mJi(θ)(8)
J i ( θ ) = y ( i ) log ( h ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h ( x ( i ) ) ) (9) J_i(\boldsymbol \theta)=y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag{9} Ji(θ)=y(i)log(h(x(i)))+(1−y(i))log(1−h(x(i)))(9)
下面先推导出 ∂ J i ∂ θ \frac{\partial J_i}{\partial \boldsymbol \theta} ∂θ∂Ji,省略 J i J_i Ji表达式中 x ( i ) \boldsymbol x^{(i)} x(i)、 y ( i ) y^{(i)} y(i)和 h ( i ) h^{(i)} h(i)的上标 ( i ) (i) (i),有:
∂ J i ( θ ) ∂ θ = y ∂ log h ∂ θ + ( 1 − y ) ∂ log ( 1 − h ) ∂ θ = y h ∂ h ∂ θ + ( 1 − y ) ( 1 − h ) ∂ ( 1 − h ) ∂ θ = y − h h ( 1 − h ) ∂ h ∂ θ = y − h h ( 1 − h ) ∂ σ ( z ) ∂ θ = y − h h ( 1 − h ) ∂ σ ( z ) ∂ z ∂ z ∂ θ = y − h h ( 1 − h ) h ( 1 − h ) ∂ θ T x ∂ θ = ( y − h ) x \begin{aligned} \frac{\partial J_i(\boldsymbol \theta)}{\partial \boldsymbol \theta} &=y\frac{\partial \log h}{\partial \boldsymbol \theta} + (1-y)\frac{\partial \log (1-h)}{\partial \boldsymbol \theta} \\ &=\frac yh \frac{\partial h}{\partial \boldsymbol \theta} +\frac{ (1-y)}{(1-h)}\frac{\partial(1-h)}{\partial \boldsymbol \theta} \\ &=\frac{y-h}{h(1-h)} \frac{\partial h}{\partial \boldsymbol \theta} \\ &=\frac{y-h}{h(1-h)} \frac{\partial \sigma(z)}{\partial \boldsymbol \theta}\\ &=\frac{y-h}{h(1-h)} \frac{\partial \sigma(z)}{\partial z} \frac{\partial z}{\partial \boldsymbol \theta}\\ &=\frac{y-h}{h(1-h)} h(1-h) \frac{\partial \boldsymbol \theta^T \boldsymbol x}{\partial \boldsymbol \theta}\\ &=(y-h)\boldsymbol x\\ \end{aligned} ∂θ∂Ji(θ)=y∂θ∂logh+(1−y)∂θ∂log(1−h)=hy∂θ∂h+(1−h)(1−y)∂θ∂(1−h)=h(1−h)y−h∂θ∂h=h(1−h)y−h∂θ∂σ(z)=h(1−h)y−h∂z∂σ(z)∂θ∂z=h(1−h)y−hh(1−h)∂θ∂θTx=(y−h)x
补好上标 ( i ) (i) (i)则是:
∂ J i ∂ θ = ( y ( i ) − h ( i ) ) x ( i ) (10) \frac{\partial J_i}{\partial \boldsymbol \theta}=(y^{(i)}-h^{(i)})\boldsymbol x^{(i)}\tag{10} ∂θ∂Ji=(y(i)−h(i))x(i)(10)
由式 ( 8 ) (8) (8)和式 ( 10 ) (10) (10)得
∂ J ∂ θ = − 1 m ∑ i = 1 m ∂ J i ∂ θ = 1 m ∑ i = 1 m ( h ( i ) − y ( i ) ) x ( i ) (11) \frac{\partial J}{\partial \boldsymbol \theta}=-\frac1m\sum_{i=1}^m \frac{\partial J_i}{\partial \boldsymbol \theta}=\frac1m\sum_{i=1}^m (h^{(i)}-y^{(i)})\boldsymbol x^{(i)} \tag{11} ∂θ∂J=−m1i=1∑m∂θ∂Ji=m1i=1∑m(h(i)−y(i))x(i)(11)
故梯度更新式为 θ ← θ − α 1 m ∑ i = 1 m ( h ( i ) − y ( i ) ) x ( i ) (12) \boldsymbol \theta \leftarrow \boldsymbol \theta-\alpha \frac 1 m\sum_{i=1}^m (h^{(i)}-y^{(i)})\boldsymbol x^{(i)} \tag{12} θ←θ−αm1i=1∑m(h(i)−y(i))x(i)(12)
References:
[1] 机器学习 3.3节. 周志华