菜鸟一枚,有错误欢迎指正,但不要喷。
在线性回归中,要求解的模型是
(1)
y
=
w
T
x
+
b
y=\mathbf{w^Tx}+b \tag{1}
y=wTx+b(1)
实际上线性回归模型在稍加修改后也可以求解其他问题,例如考虑模型
(2)
y
=
e
w
T
x
+
b
y=e^{\mathbf{w^Tx}+b} \tag{2}
y=ewTx+b(2)
两边同时取对数后有
(3)
l
n
y
=
w
T
x
+
b
lny=\mathbf{w^Tx}+b \tag{3}
lny=wTx+b(3)
在形式上仍然是线性回归,但实际上已经是输入空间到输出空间的非线性映射。更一般的,考虑单调可微函数
g
(
⋅
)
g(\cdotp)
g(⋅),令
(4)
y
=
g
−
1
(
w
T
x
+
b
)
y=g^{-1}(\mathbf{w^Tx}+b) \tag{4}
y=g−1(wTx+b)(4)
这样得到的模型称为广义线性模型,其中函数称为
g
(
⋅
)
g(\cdotp)
g(⋅)“联系函数”。
接下来就可以利用广义线性模型来解决分类问题,也就是对数几率回归。
在二分类任务中,输出标记
y
∈
{
0
,
1
}
y \in \{0,1\}
y∈{0,1},而线性回归模型
z
=
w
T
x
+
b
z=\mathbf{w^Tx}+b
z=wTx+b产生的预测值是实值,于是需要将实值
z
z
z转换为0/1值,最理想的函数是单位越阶函数(亦称Heaviside函数)
(5)
y
=
{
0
,
z
<
0
0.5
,
z
=
0
1
,
z
>
0
y=\begin{cases} 0, &z < 0 \\ 0.5, &z=0 \\ 1, &z>0 \end{cases} \tag{5}
y=⎩⎪⎨⎪⎧0,0.5,1,z<0z=0z>0(5)
即若预测值大于0就判为正例,小于0则判为反例,为0则任意判别。而单位越阶函数不连续,因此不能用作
g
−
1
(
⋅
)
g^{-1}(\cdotp)
g−1(⋅),因此便有了替代函数
(6)
y
=
1
1
+
e
−
z
y=\frac{1}{1+e^{-z}} \tag{6}
y=1+e−z1(6)
这是一个Sigmoid函数,即形似S型的函数。
Sigmoid函数能够将z值转换为0和1之间的y值,其输出值在z=0附近变化很陡,能够很好的近似单位越阶函数。
将Sigmoid函数作为
g
−
1
(
⋅
)
g^{-1}(\cdotp)
g−1(⋅)代入式(4)得到
(7)
y
=
1
1
+
e
−
(
w
T
x
+
b
)
y=\frac{1}{1+e^{-(\mathbf{w^Tx}+b)}} \tag{7}
y=1+e−(wTx+b)1(7)
若
y
≥
0.5
y \geq 0.5
y≥0.5则预测为正例,反之预测为反例。上式可以变化为
(8)
l
n
y
1
−
y
=
w
T
x
+
b
ln\frac{y}{1-y}=\mathbf{w^Tx}+b \tag{8}
ln1−yy=wTx+b(8)
若将
y
y
y看作样本
x
\mathbf{x}
x作为正例的可能性,则
1
−
y
1-y
1−y是其反例的可能性。两者的比值
(9)
y
1
−
y
\frac{y}{1-y} \tag{9}
1−yy(9)
称为“几率",反应了样本
x
\mathbf{x}
x作为正例的相对可能性,取对数则得到"对数几率"
(10)
l
n
y
1
−
y
ln\frac{y}{1-y} \tag{10}
ln1−yy(10)
这就是对数几率回归名称的由来。对数几率回归在得到分类结果的同时还能得到近似概率预测。
接下来介绍如何确定
w
\mathbf{w}
w和
b
b
b。如果将式(8)中的
y
y
y看作类后验概率
p
{
y
=
1
∣
x
}
p\{y=1 \mid \mathbf{x}\}
p{y=1∣x},式(8)可重写为
(11)
l
n
p
(
y
=
1
∣
x
)
p
(
y
=
0
∣
x
)
=
w
T
x
+
b
ln \frac{p(y=1 \mid \mathbf{x})}{p(y=0 \mid \mathbf{x})}=\mathbf{w^Tx}+b \tag{11}
lnp(y=0∣x)p(y=1∣x)=wTx+b(11)
显然有
(12)
p
(
y
=
1
∣
x
)
=
e
(
w
T
x
+
b
)
1
+
e
(
w
T
x
+
b
)
p(y=1 \mid \mathbf{x})=\frac{e^{(\mathbf{w^Tx}+b)}}{1+e^{(\mathbf{w^Tx}+b)}}\tag{12}
p(y=1∣x)=1+e(wTx+b)e(wTx+b)(12)
(13) p ( y = 0 ∣ x ) = 1 1 + e ( w T x + b ) p(y=0 \mid \mathbf{x})=\frac{1}{1+e^{(\mathbf{w^Tx}+b)}}\tag{13} p(y=0∣x)=1+e(wTx+b)1(13)
于是可以通过极大似然估计来估计
w
\mathbf{w}
w,
b
b
b,给定数据集
{
(
x
i
,
y
i
)
}
i
=
1
m
\{(\mathbf{x_i}, y_i)\}^m_{i=1}
{(xi,yi)}i=1m,对率回归模型最大化对数似然
(14)
l
(
w
,
b
)
=
∑
i
=
1
m
l
n
p
(
y
i
∣
x
i
;
w
,
b
)
l(\mathbf{w}, b)=\sum_{i=1}^{m}lnp(y_i \mid \mathbf{x_i};\mathbf{w},b)\tag{14}
l(w,b)=i=1∑mlnp(yi∣xi;w,b)(14)
为方便书写,令
β
=
(
w
,
b
)
T
\pmb{\beta}=(\mathbf{w},b)^T
βββ=(w,b)T,
x
^
=
(
x
,
1
)
T
\hat{\mathbf{x}}=(\mathbf{x},1)^T
x^=(x,1)T,则
w
T
x
+
b
\mathbf{w^Tx}+b
wTx+b可简写为
β
T
x
\pmb{\beta}^T\mathbf{x}
βββTx,再令
p
1
(
x
^
;
β
)
=
p
(
y
=
1
∣
x
^
;
β
)
p_1(\hat{\mathbf{x}};\pmb{\beta})=p(y=1 \mid \hat{\mathbf{x}};\pmb{\beta})
p1(x^;βββ)=p(y=1∣x^;βββ),
p
0
(
x
^
;
β
)
=
p
(
y
=
0
∣
x
^
;
β
)
=
1
−
p
1
(
x
^
;
β
)
p_0(\hat{\mathbf{x}};\pmb{\beta})=p(y=0 \mid \hat{\mathbf{x}};\pmb{\beta})=1-p_1(\hat{\mathbf{x}};\pmb{\beta})
p0(x^;βββ)=p(y=0∣x^;βββ)=1−p1(x^;βββ),接下来对似然函数进行化简,
p
(
y
i
∣
x
i
;
w
,
b
)
p(y_i \mid \mathbf{x_i};\mathbf{w},b)
p(yi∣xi;w,b)可以写作:
p
(
y
i
∣
x
i
^
;
β
)
=
{
p
1
(
x
^
;
β
)
y
i
=
1
p
0
(
x
^
;
β
)
y
i
=
0
p(y_i \mid \hat{\mathbf{x_i}};\pmb{\beta})=\begin{cases} p_1(\hat{\mathbf{x}};\pmb{\beta}) &y_i=1\\ p_0(\hat{\mathbf{x}};\pmb{\beta}) &y_i=0 \end{cases}
p(yi∣xi^;βββ)={p1(x^;βββ)p0(x^;βββ)yi=1yi=0
取对数
l
n
p
(
y
i
∣
x
i
^
;
β
)
=
y
i
l
n
p
1
(
x
^
;
β
)
+
(
1
−
y
i
)
l
n
p
0
(
x
^
;
β
)
lnp(y_i \mid \hat{\mathbf{x_i}};\pmb{\beta})=y_ilnp_1(\hat{\mathbf{x}};\pmb{\beta})+(1-y_i)lnp_0(\hat{\mathbf{x}};\pmb{\beta})
lnp(yi∣xi^;βββ)=yilnp1(x^;βββ)+(1−yi)lnp0(x^;βββ)
取指数
p
(
y
i
∣
x
i
^
;
β
)
=
p
1
(
x
^
;
β
)
y
i
⋅
p
0
(
x
^
;
β
)
1
−
y
i
p(y_i \mid \hat{\mathbf{x_i}};\pmb{\beta})=p_1(\hat{\mathbf{x}};\pmb{\beta})^{y_i}\cdotp p_0(\hat{\mathbf{x}};\pmb{\beta})^{1-y_i}
p(yi∣xi^;βββ)=p1(x^;βββ)yi⋅p0(x^;βββ)1−yi
代入式(14)得
l
(
β
)
=
∑
i
=
1
m
l
n
(
p
1
(
x
^
;
β
)
y
i
⋅
p
0
(
x
^
;
β
)
1
−
y
i
)
=
∑
i
=
1
m
(
y
i
l
n
p
1
(
x
^
;
β
)
+
(
1
−
y
i
)
l
n
p
0
(
x
^
;
β
)
)
=
∑
i
=
1
m
(
y
i
(
l
n
p
1
(
x
^
;
β
)
−
l
n
p
0
(
x
^
;
β
)
)
+
l
n
p
0
(
x
^
;
β
)
)
=
∑
i
=
1
m
(
y
i
l
n
p
1
(
x
^
;
β
)
p
0
(
x
^
;
β
)
+
l
n
p
0
(
x
^
;
β
)
)
=
∑
i
=
1
m
(
y
i
β
T
x
^
+
l
n
1
1
+
e
(
w
T
x
+
b
)
)
=
∑
i
=
1
m
(
y
i
β
T
x
^
−
l
n
(
1
+
e
(
w
T
x
+
b
)
)
)
\begin{aligned} l(\pmb{\beta})&=\sum_{i=1}^{m}ln(p_1(\hat{\mathbf{x}};\pmb{\beta})^{y_i}\cdotp p_0(\hat{\mathbf{x}};\pmb{\beta})^{1-y_i})\\ &=\sum_{i=1}^{m}(y_ilnp_1(\hat{\mathbf{x}};\pmb{\beta})+(1-y_i)lnp_0(\hat{\mathbf{x}};\pmb{\beta}))\\ &=\sum_{i=1}^{m}(y_i(lnp_1(\hat{\mathbf{x}};\pmb{\beta})-lnp_0(\hat{\mathbf{x}};\pmb{\beta}))+lnp_0(\hat{\mathbf{x}};\pmb{\beta}))\\ &=\sum_{i=1}^{m}(y_iln\frac{p_1(\hat{\mathbf{x}};\pmb{\beta})}{p_0(\hat{\mathbf{x}};\pmb{\beta})}+lnp_0(\hat{\mathbf{x}};\pmb{\beta}))\\ &=\sum_{i=1}^{m}(y_i\pmb{\beta}^T\hat{\mathbf{x}}+ln\frac{1}{1+e^{(\mathbf{w^Tx}+b)}})\\ &=\sum_{i=1}^{m}(y_i\pmb{\beta}^T\hat{\mathbf{x}}-ln(1+e^{(\mathbf{w^Tx}+b)})) \end{aligned}
l(βββ)=i=1∑mln(p1(x^;βββ)yi⋅p0(x^;βββ)1−yi)=i=1∑m(yilnp1(x^;βββ)+(1−yi)lnp0(x^;βββ))=i=1∑m(yi(lnp1(x^;βββ)−lnp0(x^;βββ))+lnp0(x^;βββ))=i=1∑m(yilnp0(x^;βββ)p1(x^;βββ)+lnp0(x^;βββ))=i=1∑m(yiβββTx^+ln1+e(wTx+b)1)=i=1∑m(yiβββTx^−ln(1+e(wTx+b)))
于是最大化的目标为
l
(
β
)
=
∑
i
=
1
m
(
y
i
β
T
x
^
−
l
n
(
1
+
e
(
w
T
x
+
b
)
)
)
l(\pmb{\beta})=\sum_{i=1}^{m}(y_i\pmb{\beta}^T\hat{\mathbf{x}}-ln(1+e^{(\mathbf{w^Tx}+b)}))
l(βββ)=i=1∑m(yiβββTx^−ln(1+e(wTx+b)))
等价于最小化
(15)
l
(
β
)
=
∑
i
=
1
m
(
−
y
i
β
T
x
^
+
l
n
(
1
+
e
(
w
T
x
+
b
)
)
)
l(\pmb{\beta})=\sum_{i=1}^{m}(-y_i\pmb{\beta}^T\hat{\mathbf{x}}+ln(1+e^{(\mathbf{w^Tx}+b)}))\tag{15}
l(βββ)=i=1∑m(−yiβββTx^+ln(1+e(wTx+b)))(15)
这是关于
β
\pmb{\beta}
βββ的高阶可导凸函数,使用梯度下降法即可求解,仿照scikit-learn的实现模式,代码如下:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import preprocessing
import numpy as np
# 对数几率回归
class LogitRegression:
def __init__(self):
self._beta = None
self.coef_ = None
self.interception_ = None
def _sigmoid(self, x):
return 1. / (1. + np.exp(-x))
def fit(self, X, y, eta=0.01, n_iters=1e4, eps=1e-8):
assert(X.shape[0] == y.shape[0])
def L(beta, X_b, y):
try:
vec = beta.dot(X_b.T)
return np.sum(-vec * y + np.log(1 + np.exp(vec))) / len(X_b)
except Exception:
return float('inf')
def dL(beta, X_b, y):
try:
vec = beta.dot(X_b.T)
coef = np.exp(vec) / (1 + np.exp(vec))
return (-y.dot(X_b) + X_b.T.dot(coef)) / len(X_b)
except Exception:
return float('inf')
def gradient_decent(X_b, y, initial_beta, eta, n_iters, eps):
i_iters = 0
beta = initial_beta
while i_iters < n_iters:
gradient = dL(beta, X_b, y)
last_beta = beta
beta = beta - eta * gradient
if (abs(L(beta, X_b, y) - L(last_beta, X_b, y)) < eps):
break
i_iters += 1
return beta
X_b = np.hstack([np.ones((len(X), 1)), X])
initial_beta = np.zeros(X_b.shape[1])
self._beta = gradient_decent(X_b, y, initial_beta, eta, n_iters, eps)
self.coef_ = self._beta[1:]
self.interception_ = self._beta[0]
return self
def fit_sgd(self, X, y, n_iters=5, t0=5, t1=50):
assert X.shape[0] == y.shape[0]
assert n_iters >= 1
def dL_sgd(beta, X_b_i, y_i):
coef = np.exp(beta.dot(X_b_i)) / (1 + np.exp(beta.dot(X_b_i)))
return (-y_i + coef) * X_b_i
def gradient_decent_sgd(X_b, y, initial_beta, n_iters, t0, t1):
def learning_rate(t):
return t0 / (t1 + t)
beta = initial_beta
m = len(X_b)
for i in range(n_iters):
indexes = np.random.permutation(m)
X_b_new = X_b[indexes]
y_new = y[indexes]
for j in range(m):
gradient = dL_sgd(beta, X_b_new[j], y_new[j])
beta = beta - learning_rate(i * m + j) * gradient
return beta
X_b = np.hstack([np.ones((len(X), 1)), X])
initial_beta = np.zeros(X_b.shape[1])
self._beta = gradient_decent_sgd(X_b, y, initial_beta, n_iters, t0, t1)
self.coef_ = self._beta[1:]
self.interception_ = self._beta[0]
return self
def predict_probability(self, X_predict):
X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
return self._sigmoid(X_b.dot(self._beta))
def predict(self, X_predict):
assert self.coef_ is not None and self.interception_ is not None
assert X_predict.shape[1] == len(self.coef_)
probability = self.predict_probability(X_predict)
return np.array(probability >= 0.5, dtype='int')
def score(self, X_test, y_test):
y_predict = self.predict(X_test)
return np.sum(y_predict == y_test) / len(y_test)
def __repr__(self):
return "LogitRegression()"
if __name__ == "__main__":
breast_cancer = datasets.load_breast_cancer()
y = breast_cancer.target
X = breast_cancer.data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = preprocessing.StandardScaler()
train_X = scaler.fit(train_X).transform(train_X)
test_X = scaler.transform(test_X)
logit = LogitRegression()
logit.fit(train_X, train_y)
print(logit.score(test_X, test_y))
运行结果如下所示:
扩展:多分类对数几率回归
y
y
y的取值集合是
{
1
,
2
,
…
,
K
}
\{1,2,\dots,K\}
{1,2,…,K},多项对数几率回归的模型是
P
(
y
=
k
∣
x
)
=
exp
(
β
k
T
x
^
)
1
+
∑
k
=
1
K
−
1
exp
(
β
k
T
x
^
)
,
k
=
1
,
2
,
…
,
K
−
1
P
(
y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
exp
(
β
k
T
x
^
)
\begin{aligned} P(y=k\mid\boldsymbol{x})&=\frac{\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}})}{1+\sum_{k=1}^{K-1}\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}})},\quad\quad k=1,2,\dots,K-1\\ P(y=K\mid\boldsymbol{x})&=\frac{1}{1+\sum_{k=1}^{K-1}\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}})} \end{aligned}
P(y=k∣x)P(y=K∣x)=1+∑k=1K−1exp(βkTx^)exp(βkTx^),k=1,2,…,K−1=1+∑k=1K−1exp(βkTx^)1
多项对数几率回归的输入
Y
∈
R
K
\mathcal{Y}\in \R^K
Y∈RK,其取值为
y
k
=
{
1
,
x
属
于
第
k
类
0
,
否
则
y_k= \begin{cases} 1,\quad \boldsymbol{x}属于第k类\\ 0,\quad 否则 \end{cases}
yk={1,x属于第k类0,否则
此时的似然函数是
∏
i
=
1
N
∏
k
=
1
K
P
(
y
i
k
=
1
∣
x
^
i
)
y
i
k
\prod_{i=1}^{N}\prod_{k=1}^{K}P(y_i^k=1\mid\hat{\boldsymbol{x}}_i)^{y_i^k}
i=1∏Nk=1∏KP(yik=1∣x^i)yik
对数似然是
L
(
β
)
=
∑
i
=
1
N
∑
k
=
1
K
y
i
k
log
P
(
y
i
k
=
1
∣
x
i
)
=
∑
i
=
1
N
{
∑
k
=
1
K
−
1
y
i
k
log
exp
(
β
k
T
x
^
i
)
1
+
∑
j
=
1
K
−
1
exp
(
β
j
T
x
^
i
)
+
y
i
K
log
1
1
+
∑
j
=
1
K
−
1
exp
(
β
j
T
x
^
i
)
}
=
∑
i
=
1
N
{
∑
k
=
1
K
−
1
y
i
k
β
k
T
x
^
i
−
log
(
1
+
∑
k
=
1
K
−
1
exp
(
β
k
T
x
^
i
)
)
}
\begin{aligned} L(\boldsymbol{\beta}) &=\sum_{i=1}^{N}\sum_{k=1}^{K}y_i^k\log P(y_i^k=1\mid\boldsymbol{x}_i)\\ &=\sum_{i=1}^{N}\{\sum_{k=1}^{K-1}y_i^k\log\frac{\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}}_i)}{1+\sum_{j=1}^{K-1}\exp(\boldsymbol{\beta}_j^T\hat{\boldsymbol{x}}_i)}+y_i^K\log\frac{1}{1+\sum_{j=1}^{K-1}\exp(\boldsymbol{\beta}_j^T\hat{\boldsymbol{x}}_i)}\}\\ &=\sum_{i=1}^{N}\{\sum_{k=1}^{K-1}y_i^k\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}}_i-\log(1+\sum_{k=1}^{K-1}\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}}_i))\} \end{aligned}
L(β)=i=1∑Nk=1∑KyiklogP(yik=1∣xi)=i=1∑N{k=1∑K−1yiklog1+∑j=1K−1exp(βjTx^i)exp(βkTx^i)+yiKlog1+∑j=1K−1exp(βjTx^i)1}=i=1∑N{k=1∑K−1yikβkTx^i−log(1+k=1∑K−1exp(βkTx^i))}
其中用到了条件
∑
k
=
1
K
y
i
k
=
1
\sum_{k=1}^{K}y_i^k=1
∑k=1Kyik=1,求偏导得
∂
L
(
β
)
∂
β
k
=
∑
i
=
1
N
{
y
i
k
x
i
^
−
exp
(
β
k
T
x
^
i
)
x
^
i
1
+
∑
k
=
1
K
−
1
exp
(
β
k
T
x
^
i
)
}
\frac{\partial L(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}_k}=\sum_{i=1}^N\{ y_i^k\hat{\boldsymbol{x}_i}-\frac{\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}}_i)\hat{\boldsymbol{x}}_i}{1+\sum_{k=1}^{K-1}\exp(\boldsymbol{\beta}_k^T\hat{\boldsymbol{x}}_i)} \}
∂βk∂L(β)=i=1∑N{yikxi^−1+∑k=1K−1exp(βkTx^i)exp(βkTx^i)x^i}
之后可以使用梯度下降法求解。