第3章 逻辑回归
任务学习12 二元分类问题
任务学习13 逻辑函数
f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1
11+e∞=11+∞=0\frac{1}{1 + e^{\infty}} = \frac{1}{1 + \infty} = 01+e∞1=1+∞1=0
11+e−∞=11−∞=1\frac{1}{1 + e^{- \infty}} = \frac{1}{1 - \infty} = 11+e−∞1=1−∞1=1
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp( - x ))
x = np.linspace(-6, 6, 100)
y = sigmoid(x)
mark = 0.5 * np.ones(x.shape)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.plot(x, mark, ":")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.grid()
plt.show()
任务学习14 指数与对数 、逻辑回归
- 指数与对数
def exp(x):
return np.exp(x)
def ln(x):
return np.log(x)
def lin(x):
return x
x = np.linspace(-4, 4, 100)
y_exp = exp(x)
y_ln = ln(x[np.nonzero(x > 0)])
y_lin = lin(x)
fig = plt.figure(figsize = (5, 5))
ax = fig.add_subplot(111)
ax.plot(x, y_exp, label="$y = e^{x}$")
ax.plot(x[np.nonzero(x > 0)], y_ln, label="$y = ln(x)$")
ax.plot(x, y_lin, label="$y = x$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$")
ax.set_ylim(-4, 4)
ax.grid()
ax.legend()
plt.show()
- 逻辑回归
解决二元(0、1)分类问题
P(y=1∣x;θ)=f(x;θ)=11+e−θTxP(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}P(y=1∣x;θ)=f(x;θ)=1+e−θTx1
θTx=θ0+θ1x1+θ2x2+⋯\mathbf{\theta}^{\mathrm{T}} \mathbf{x} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdotsθTx=θ0+θ1x1+θ2x2+⋯
θ=[θ0,θ1,θ2,⋯ ]\mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots \right]θ=[θ0,θ1,θ2,⋯]
x=[1,x1,x2,⋯ ]\mathbf{x} = \left[ 1, x_1, x_2, \cdots \right]x=[1,x1,x2,⋯]
P(y=1∣x)>0.5P(y = 1 | \mathbf{x}) > 0.5P(y=1∣x)>0.5,推理为1;否则推理为0。
- 逻辑回归知识点
类别1的概率:P=11+e−θTxP = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}P=1+e−θTx1
类别0的概率:1−P=e−θTx1+e−θTx=11+eθTx1 - P = \frac{e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}} = \frac{1}{1 + e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}1−P=1+e−θTxe−θTx=1+eθTx1
类别1与0概率的比值:P1−P=eθTx\frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}1−PP=eθTx
类别1与0概率比值的自然对数:lnP1−P=θTx\ln \frac{P}{1 - P} = \mathbf{\theta}^{\mathrm{T}} \mathbf{x}ln1−PP=θTx
任务学习15 逻辑回归示例
年龄(x1x_1x1) | 年收入(x2x_2x2)(万元) | 是否买车(1:是;2:否) |
---|---|---|
20 | 3 | 0 |
23 | 7 | 1 |
31 | 10 | 1 |
42 | 13 | 1 |
50 | 7 | 0 |
60 | 5 | 0 |
— | — | — |
28 | 8 | ? |
from sklearn import linear_model
X = [[20, 3],
[23, 7],
[31, 10],
[42, 13],
[50, 7],
[60, 5]]
y = [0,
1,
1,
1,
0,
0]
lr = linear_model.LogisticRegression()
lr.fit(X, y)
testX = [[28, 8]]
label = lr.predict(testX)
print("predicted label = {}".format(label))
prob = lr.predict_proba(testX)
print("probability = {}".format(prob))
print("theta_0 = {0[0]}, theta_1 = {1[0][0]}, theta_0 = {1[0][1]}".format(lr.intercept_, lr.coef_))
predicted label = [1]
probability = [[0.14694811 0.85305189]]
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_0 = 0.915557452347983
任务学习16 损失函数
类别1概率:
P(y=1∣x;θ)=f(x;θ)=11+e−θTxP(y = 1 | \mathbf{x}; \mathbf{\theta}) = f(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}P(y=1∣x;θ)=f(x;θ)=1+e−θTx1
损失函数:
J(θ)=−∑i=1N[y(i)lnP(Y=1∣X=x(i);θ)+(1−y(i))ln(1−P(Y=1∣X=x(i);θ))]J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) + \left( 1 - y^{(i)} \right) \ln \left( 1 - P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) \right) \right]J(θ)=−∑i=1N[y(i)lnP(Y=1∣X=x(i);θ)+(1−y(i))ln(1−P(Y=1∣X=x(i);θ))]
损失函数梯度:
∇θJ(θ)=∑i=1N(P(Y=1∣X=x(i);θ)−y(i))x(i)=∑i=1Nx(i)(f(x(i);θ)−y(i))\begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \left( P(Y = 1 | \mathbf{X} = \mathbf{x}^{(i)}; \theta) - y^{(i)} \right) \mathbf{x}^{(i)} \\ = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned}∇θJ(θ)==i=1∑N(P(Y=1∣X=x(i);θ)−y(i))x(i)i=1∑Nx(i)(f(x(i);θ)−y(i))
任务学习17 损失函数推演
- 求导
(f(x)g(x))′=f′(x)g(x)+f(x)g′(x)\left( f(x)g(x) \right) ^{\prime} = f^{\prime}(x)g(x) + f(x)g^{\prime}(x)(f(x)g(x))′=f′(x)g(x)+f(x)g′(x)
- 对数
log(xy)=log(x)+log(y)\log(xy) = \log(x) + \log(y)log(xy)=log(x)+log(y)
log′(x)=1x\log^{\prime}(x) = \frac{1}{x}log′(x)=x1
- 链式法则
z=f(y)y=g(x)↓dzdx=dzdydydx \begin{aligned} z = & f(y) \\ y = & g(x) \\ \downarrow & \\ \frac{d z}{d x} = & \frac{d z}{d y} \frac{d y}{d x} \end{aligned} z=y=↓dxdz=f(y)g(x)dydzdxdy
- sigmoid
f(x)=11+e−x↓f′(x)=(−1)e−x(−1)(1+e−x)2=e−x1+e−x11+e−x=f(x)(1−f(x)) \begin{aligned} f(x) = & \frac{1}{1 + e^{-x}} \\ \downarrow & \\ f^{\prime}(x) = & (-1) \frac{e^{-x} (-1)}{\left( 1 + e^{-x} \right)^2} \\ = & \frac{e^{-x}}{1 + e^{-x}} \frac{1}{1 + e^{-x}} \\ = & f(x) \left( 1- f(x) \right) \end{aligned} f(x)=↓f′(x)===1+e−x1(−1)(1+e−x)2e−x(−1)1+e−xe−x1+e−x1f(x)(1−f(x))
f(z)=11+e−zz=θx↓dfdx=f(z)(1−f(z))θ \begin{aligned} f(z) = & \frac{1}{1 + e^{-z}} \\ z = & \theta x \\ \downarrow & \\ \frac{d f}{d x} = & f(z) \left( 1- f(z) \right) \theta \end{aligned} f(z)=z=↓dxdf=1+e−z1θxf(z)(1−f(z))θ
- 损失函数
训练数据集{(xi,yi)}\{ \left( \mathbf{x}_i, y_i \right) \}{(xi,yi)},i∈{1,2,⋯ ,N}i \in \{1, 2, \cdots, N \}i∈{1,2,⋯,N},xi∈Rm\mathbf{x}_i \in R^mxi∈Rm,yi∈{0,1}y_i \in \{ 0, 1 \}yi∈{0,1}
逻辑函数表示给定样本xi\mathbf{x}_ixi,分类器推理为yi=1y_i = 1yi=1的概率:
Pi=P(yi=1∣θ:xi)=f(θTxi) \begin{aligned} P_i = & P\left( y_i = 1 | \mathbf{\theta}: \mathbf{x}_i \right) \\ = & f(\mathbf{\theta}^{\mathrm{T}} \mathbf{x}_i) \end{aligned} Pi==P(yi=1∣θ:xi)f(θTxi)
似然函数
L(θ)=∏i∣yi=1NPi⋅∏i∣yi=0N(1−Pi) \begin{aligned} L(\mathbf{\theta}) = & \prod^{N}_{i | y_i = 1} P_i \cdot \prod^{N}_{i | y_i = 0} \left( 1 - P_i \right) \end{aligned} L(θ)=i∣yi=1∏NPi⋅i∣yi=0∏N(1−Pi)
目标是求使L(θ)L(\mathbf{\theta})L(θ)最大时的θ\thetaθ:
θ=argmaxθL(θ) \mathbf{\theta} = \arg \max_{\theta} L(\mathbf{\theta}) θ=argθmaxL(θ)
对数似然函数
l(θ)=logL(θ)=log[∑i∣yi=1NPi+∑i∣yi=0N(1−Pi)]=∑i∣yi=1NlogPi+∑i∣yi=0Nlog(1−Pi)=∑i=1N[yilogPi+(1−yi)log(1−Pi)] \begin{aligned} l(\theta) = \log L(\mathbf{\theta}) = & \log \left[ \sum^{N}_{i | y_i = 1} P_i + \sum^{N}_{i | y_i = 0} \left( 1 - P_i \right) \right ]\\ = & \sum^{N}_{i | y_i = 1} \log P_i + \sum^{N}_{i | y_i = 0} \log \left( 1 - P_i \right) \\ = & \sum^{N}_{i = 1} \left[ y_i \log P_i + \left( 1 - y_i \right) \log \left( 1 - P_i \right) \right] \\ \end{aligned} l(θ)=logL(θ)===log⎣⎡i∣yi=1∑NPi+i∣yi=0∑N(1−Pi)⎦⎤i∣yi=1∑NlogPi+i∣yi=0∑Nlog(1−Pi)i=1∑N[yilogPi+(1−yi)log(1−Pi)]
dl(θ)dθ=∑i=1N[yidlogPidθ+(1−yi)dlog(1−Pi)dθ]=∑i=1N[yiPi(1−Pi)Pixi+(1−yi)(−1)Pi(1−Pi)1−Pixi]=∑i=1N[yi(1−Pi)xi−(1−yi)Pixi]=∑i=1N(yi−Pi)xi \begin{aligned} \frac{d l(\theta)}{d \theta} = & \sum^{N}_{i = 1} \left[ y_i \frac{d \log P_i}{d \theta} + \left( 1 - y_i \right) \frac{d \log \left( 1 - P_i \right)}{d \theta} \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \frac{P_i \left( 1 - P_i \right)}{P_i} \mathbf{x}_i + \left( 1 - y_i \right) \frac{(- 1) P_i \left( 1 - P_i \right)}{1 - P_i} \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left[ y_i \left( 1 - P_i \right) \mathbf{x}_i - \left( 1 - y_i \right) P_i \mathbf{x}_i \right] \\ = & \sum^{N}_{i = 1} \left( y_i - P_i \right) \mathbf{x}_i \end{aligned} dθdl(θ)====i=1∑N[yidθdlogPi+(1−yi)dθdlog(1−Pi)]i=1∑N[yiPiPi(1−Pi)xi+(1−yi)1−Pi(−1)Pi(1−Pi)xi]i=1∑N[yi(1−Pi)xi−(1−yi)Pixi]i=1∑N(yi−Pi)xi
l(θ)=logL(θ)l(\theta) = \log L (\theta)l(θ)=logL(θ)是求L(θ)L (\theta)L(θ)的最大期望,定义损失函数为:
loss(θ)=−l(θ) loss(\theta) = - l(\theta) loss(θ)=−l(θ)
则:
dloss(θ)dθ=∑i=1N(Pi−yi)xi \frac{d loss(\theta)}{d \theta} = \sum^{N}_{i = 1} \left( P_i - y_i \right) \mathbf{x}_i dθdloss(θ)=i=1∑N(Pi−yi)xi
任务学习18 梯度下降法
f(x;θ)=11+e−θTxf(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}f(x;θ)=1+e−θTx1
θ=θ−α∇θJ(θ)=θ−α∑i=1Nx(i)(f(x(i);θ)−y(i)) \mathbf{\theta} = \mathbf{\theta} - \alpha \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = \mathbf{\theta} - \alpha \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right)θ=θ−α∇θJ(θ)=θ−αi=1∑Nx(i)(f(x(i);θ)−y(i))
- 系数的意义
概率比值odds=P1−P=eθTxodds = \frac{P}{1 - P} = e^{\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}odds=1−PP=eθTx
系数θj\theta_jθj意味着:假设原始odds=λ1odds = \lambda_1odds=λ1,若对应的特征xjx_jxj增加1,假设新的odds=λ2odds = \lambda_2odds=λ2,则λ1λ2≡eθj\frac{\lambda_1}{\lambda_2} \equiv e^{\theta_j}λ2λ1≡eθj
theta_0 = lr.intercept_
theta_1 = lr.coef_[0][0]
theta_2 = lr.coef_[0][1]
print("theta_0 = {0[0]}, theta_1 = {1}, theta_2 = {2}".format(theta_0, theta_1, theta_2))
testX = [[28, 8]]
ratio = prob[0][1] / prob[0][0]
testX = [[28, 9]]
prob_new = lr.predict_proba(testX)
ratio_new = prob_new[0][1] / prob_new[0][0]
ratio_of_ratio = ratio_new / ratio
print("ratio of ratio = {0}".format(ratio_of_ratio))
import math
theta2_e = math.exp(theta_2)
print("theta2 e = {}".format(theta2_e))
theta_0 = -0.04131837596993478, theta_1 = -0.1973000136829152, theta_2 = 0.915557452347983
ratio of ratio = 2.4981674731438943
theta2 e = 2.4981674731438948
θ2=0.92\theta_2 = 0.92θ2=0.92意味着,如果年收入增加1万,一个人买车和不买车的概率的比值与之前的比值相比较,增加了e0.92=2.5e^{0.92}=2.5e0.92=2.5倍。
θ1=−0.20\theta_1 = -0.20θ1=−0.20意味着,如果年龄增加1岁,一个人买车和不买车的概率的比值与之前的比值相比较,降低了e−0.20=0.82e^{-0.20}=0.82e−0.20=0.82倍。
任务学习19 应用
import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("./data/SMSSpamCollection.csv", delimiter=',', header=None)
y, X_train = df[0], df[1]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X_train)
lr = linear_model.LogisticRegression()
lr.fit(X, y)
testX = vectorizer.transform(["URGENT! Your mobile No. 1234 was awarded a Prize.",
"Hey honey, what's up?"])
predictions = lr.predict(testX)
print(predictions)
['spam' 'ham']
PS:损失函数J(θ)J(\mathbf{\theta})J(θ)对θ\thetaθ的Hessian矩阵:
- 损失函数:
J(θ)=−∑i=1N[y(i)lnf(x;θ)+(1−y(i))ln(1−f(x;θ))]J(\mathbf{\theta}) = - \sum_{i=1}^{N} \left[ y^{(i)} \ln f(\mathbf{x}; \mathbf{\theta}) + \left( 1 - y^{(i)} \right) \ln \left( 1 - f(\mathbf{x}; \mathbf{\theta}) \right) \right]J(θ)=−∑i=1N[y(i)lnf(x;θ)+(1−y(i))ln(1−f(x;θ))]
其中,
f(x;θ)=11+e−θTxf(\mathbf{x}; \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{\theta}^{\mathrm{T}} \mathbf{x}}}f(x;θ)=1+e−θTx1,
x=[1,x1,x2,⋯ ,xn]T\mathbf{x} = \left[ 1, x_1, x_2, \cdots, x_n \right]^{\mathrm{T}}x=[1,x1,x2,⋯,xn]T,
θ=[θ0,θ1,θ2,⋯ ,θn]T\mathbf{\theta} = \left[ \theta_0, \theta_1, \theta_2, \cdots, \theta_n \right]^{\mathrm{T}}θ=[θ0,θ1,θ2,⋯,θn]T
其中,x(i)\mathbf{x}^{(i)}x(i)为表示第iii条样本的列向量。
- 损失函数J(θ)J(\mathbf{\theta})J(θ)对θ\mathbf{\theta}θ的梯度:
∇θJ(θ)=∑i=1Nx(i)(f(x(i);θ)−y(i))\begin{aligned} \nabla_{\mathbf{\theta}} J(\mathbf{\theta}) = & \sum_{i=1}^{N} \mathbf{x}^{(i)} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned}∇θJ(θ)=i=1∑Nx(i)(f(x(i);θ)−y(i))
- 损失函数J(θ)J(\mathbf{\theta})J(θ)对θ\thetaθ的Hessian矩阵:
易知,J(θ)J(\mathbf{\theta})J(θ)对θp\theta_pθp的一阶偏导数为:
∂J(θ)∂θp=∑i=1Nxp(i)(f(x(i);θ)−y(i))\begin{aligned} \frac{\partial J(\mathbf{\theta})}{\partial \theta_p} = & \sum_{i=1}^{N} x^{(i)}_p \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) - y^{(i)} \right) \\ \end{aligned}∂θp∂J(θ)=i=1∑Nxp(i)(f(x(i);θ)−y(i))
J(θ)J(\mathbf{\theta})J(θ)对θp\theta_pθp和θq\theta_qθq的二阶偏导数为:
∂2J(θ)∂θp∂θq=∑i=1Nxp(i)∂f(x(i);θ)∂θq=∑i=1Nxp(i)f(x(i);θ)(1−f(x(i);θ))xq(i)=∑i=1Nf(x(i);θ)(1−f(x(i);θ))xp(i)xq(i)\begin{aligned} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_p \partial \theta_q} = & \sum_{i=1}^{N} x^{(i)}_p \frac{\partial f(\mathbf{x}^{(i)}; \mathbf{\theta})}{\partial \theta_q} \\ = & \sum_{i=1}^{N} x^{(i)}_p f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_q \\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) x^{(i)}_p x^{(i)}_q \\ \end{aligned}∂θp∂θq∂2J(θ)===i=1∑Nxp(i)∂θq∂f(x(i);θ)i=1∑Nxp(i)f(x(i);θ)(1−f(x(i);θ))xq(i)i=1∑Nf(x(i);θ)(1−f(x(i);θ))xp(i)xq(i)
注意:f(x(i);θ)(1−f(x(i);θ))f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right)f(x(i);θ)(1−f(x(i);θ))为标量,且大于零。
H(J(θ))=[∂2J(θ)∂θ1∂θ1∂2J(θ)∂θ1∂θ2⋯∂2J(θ)∂θ1∂θn∂2J(θ)∂θ2∂θ1∂2J(θ)∂θ2∂θ2⋯∂2J(θ)∂θ2∂θn⋮⋮⋱⋮∂2J(θ)∂θn∂θ1∂2J(θ)∂θn∂θ2⋯∂2J(θ)∂θn∂θn]=∑i=1N(f(x(i);θ)(1−f(x(i);θ))[x1(i)x1(i)x1(i)x2(i)⋯x1(i)xn(i)x2(i)x1(i)x2(i)x2(i)⋯x2(i)xn(i)⋮⋮⋱⋮xn(i)x1(i)xn(i)x2(i)⋯xn(i)xn(i)])=∑i=1Nf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T\begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \begin{bmatrix} \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_1 \partial \theta_n} \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_2 \partial \theta_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_1} & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_2} & \cdots & \frac{\partial^2 J(\mathbf{\theta})}{\partial \theta_n \partial \theta_n} \\ \end{bmatrix} \\ = & \sum_{i=1}^{N} \left( f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \begin{bmatrix} x^{(i)}_1 x^{(i)}_1 & x^{(i)}_1 x^{(i)}_2 & \cdots & x^{(i)}_1 x^{(i)}_n \\ x^{(i)}_2 x^{(i)}_1 & x^{(i)}_2 x^{(i)}_2 & \cdots & x^{(i)}_2 x^{(i)}_n \\ \vdots & \vdots & \ddots & \vdots \\ x^{(i)}_n x^{(i)}_1 & x^{(i)}_n x^{(i)}_2 & \cdots & x^{(i)}_n x^{(i)}_n \\ \end{bmatrix} \right)\\ = & \sum_{i=1}^{N} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}}\\ \end{aligned}H(J(θ))===⎣⎢⎢⎢⎢⎢⎡∂θ1∂θ1∂2J(θ)∂θ2∂θ1∂2J(θ)⋮∂θn∂θ1∂2J(θ)∂θ1∂θ2∂2J(θ)∂θ2∂θ2∂2J(θ)⋮∂θn∂θ2∂2J(θ)⋯⋯⋱⋯∂θ1∂θn∂2J(θ)∂θ2∂θn∂2J(θ)⋮∂θn∂θn∂2J(θ)⎦⎥⎥⎥⎥⎥⎤i=1∑N⎝⎜⎜⎜⎜⎛f(x(i);θ)(1−f(x(i);θ))⎣⎢⎢⎢⎢⎡x1(i)x1(i)x2(i)x1(i)⋮xn(i)x1(i)x1(i)x2(i)x2(i)x2(i)⋮xn(i)x2(i)⋯⋯⋱⋯x1(i)xn(i)x2(i)xn(i)⋮xn(i)xn(i)⎦⎥⎥⎥⎥⎤⎠⎟⎟⎟⎟⎞i=1∑Nf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T
- Hessian矩阵正定性分析
H(J(θ))=∑i=1mf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T\begin{aligned} H \left(J(\mathbf{\theta}) \right) = & \sum_{i=1}^{m} f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^{\mathrm{T}} \\ \end{aligned}H(J(θ))=i=1∑mf(x(i);θ)(1−f(x(i);θ))x(i)(x(i))T
(1)f(x(i);θ)(1−f(x(i);θ))>0f(\mathbf{x}^{(i)}; \mathbf{\theta}) \left( 1 - f(\mathbf{x}^{(i)}; \mathbf{\theta}) \right) \gt 0f(x(i);θ)(1−f(x(i);θ))>0
(2)H(J(θ))H \left(J(\mathbf{\theta}) \right)H(J(θ))在形式上类似于随机过程向量的自相关矩阵
当m≫0m \gg 0m≫0时,可得:
E[xjxk]≈1m∑i=1mxj(i)xk(i)\mathrm{E}\left[ x_j x_k \right] \approx \frac{1}{m} \sum_{i=1}^{m} x^{(i)}_j x^{(i)}_kE[xjxk]≈m1i=1∑mxj(i)xk(i)
当x(i)\mathbf{x}^{(i)}x(i)的各分量xjx_jxj相互独立时,可知:
E[xjxk]{=0,if j̸=k>0,if j=k\mathrm{E}\left[ x_j x_k \right] \begin{cases} = 0, & \quad \text{if} \ j \not= k \\ \gt 0, & \quad \text{if} \ j = k \\ \end{cases}E[xjxk]{=0,>0,if j̸=kif j=k
当m≫nm \gg nm≫n时,E[xxT]\mathrm{E}\left[ \mathbf{x} \mathbf{x}^\text{T} \right]E[xxT]为满秩对角矩阵,且对角元素均大于零,H(J(θ))H \left(J(\mathbf{\theta}) \right)H(J(θ))是正定的(positive definite);否则H(J(θ))H \left(J(\mathbf{\theta}) \right)H(J(θ))是半正定的(semi-positive definite)。
当H(J(θ))H \left(J(\mathbf{\theta}) \right)H(J(θ))满足正定条件时(m≫nm \gg nm≫n),J(θ)J(\mathbf{\theta})J(θ)为凸优函数,有全局最优解,即批量梯度下降(batch gradient descent)能够保证J(θ)J(\mathbf{\theta})J(θ)收敛到全局最小值;当H(J(θ))H \left(J(\mathbf{\theta}) \right)H(J(θ))满足半正定条件时(m<nm \lt nm<n),即小批量梯度下降(batch gradient descent)或随机梯度下降(stochastic gradient descent)可能使J(θ)J(\mathbf{\theta})J(θ)陷入局部最小值。