对数几率回归

对数几率回归:理论与实践
本文详细介绍了对数几率回归(Logistic Regression),包括公式推导、模拟算法、随机梯度上升(SGA)及其在实际问题中的应用。讨论了其优缺点,并给出了在机器学习实战中处理数据和训练算法的步骤。通过对Horse-Colic数据集的分析,展示了如何利用随机梯度上升算法找到最佳回归参数。

对数几率回归(logistic regression),也常被叫做逻辑回归,用于分类任务当中。

优点:计算代价不高,易于理解和实现

缺点:容易欠拟合,分类精度可能不高

在使用线性模型进行二分类任务时,需要找一个单调可微函数将分类任务的真实标记y与线性回归模型的预测值联系起来,二分类任务中,输出非0即1,而线性模型则是连续的任意值,二者需要做映射,考虑亥维赛德阶跃函数(单位阶跃函数)
f ( x ) = { 0 x < 0 0.5 x = 0 1 x > 0 f(x) = \begin{cases} 0 & x<0 \\ 0.5 & x=0 \\ 1 & x>0 \end{cases} f(x)=00.51x<0x=0x>0
但该函数不连续,在后续的处理中不是很合适,所以采用另一种近似阶跃函数的单调可微函数,对数几率函数(Sigmoid 函数)。
y = 1 1 + e − z y = \frac{1}{1+e^{-z}} y=1+ez1
将线性回归模型的结果z计算函数值g(z),g(z)>0在分类为1,否则为0。

公式推导

假设z = Θ^Tx为线性回归模型
g ( z ) = 1 1 + e − θ T x ⟹ 1 − g ( z ) g ( z ) = 1 + e − θ T x g(z) = \frac{1}{1+e^{-\theta^{T}x}} \Longrightarrow \frac{1-g(z)}{g(z)} = 1+e^{-\theta^{T}x} g(z)=1+eθTx1g(z)1g(z)=1+eθTx
假设g(z)是将样本分类为1的概率,则:
P { g ( z ) = 1 ∣ x ; θ } = 1 1 + e − θ T x = h θ ( x ) P { g ( z ) = 0 ∣ x ; θ } = e − θ T x 1 + e − θ T x = 1 − h θ ( x ) P\{g(z) = 1|x;\theta\} = \frac{1}{1+e^{-\theta^{T}x}} = h_\theta(x) \\ P\{g(z) = 0|x;\theta\} = \frac{e^{-\theta^{T}x}}{1+e^{-\theta^{T}x}} = 1-h_\theta(x) P{g(z)=1x;θ}=1+eθTx1=hθ(x)P{g(z)=0x;θ}=1+eθTxeθTx=1hθ(x)
构建损失函数:
L o s s ( h θ ( x ) , y ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y Loss(h_\theta(x),y) = h_\theta(x)^y(1-h_\theta(x))^{1-y} Loss(hθ(x),y)=hθ(x)y(1hθ(x))1y
根据极大似然法,最小化代价函数:
C o s t ( h θ ( x ) , y ) = ∏ i n h θ ( x i ) y i ( 1 − h θ ( x i ) ) ( 1 − y ) i Cost(h_\theta(x),y) = \prod_i^n h_\theta(x^{i})^{y^i}(1-h_\theta(x^i))^{(1-y)^i} Cost(hθ(x),y)=inhθ(xi)yi(1hθ(xi))(1y)i
取对数:
J ( θ ) = ∑ i n [ y i l o g ( h θ ( x i ) ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ] J(\theta) = \sum_{i}^n[y^ilog(h_\theta(x^i))+(1-y^i)log(1-h_\theta(x^i))] J(θ)=in[yilog(hθ(xi))+(1yi)log(1hθ(xi))]
梯度上升法求θ的极大似然估计值:
θ j : = θ j + α ∂ J ( θ ) θ j \theta_j:= \theta_j + \alpha \frac{\partial J(\theta)}{\theta_j} θj:=θj+αθjJ(θ)
链式求导
∂ J ( θ ) θ j = ∂ J ( θ ) ∂ h θ ( x ) ∂ h θ ( x ) ∂ θ T x ∂ θ T x ∂ θ j ∂ J ( θ ) ∂ h θ ( x ) = y ∗ 1 h θ ( x ) + ( y − 1 ) 1 1 − h θ ( x ) ∂ h θ ( x ) ∂ θ T x = e − θ T x ( 1 + e − θ T x ) 2 = h θ ( x ) ( 1 − h θ ( x ) ) ∂ θ T x ∂ θ j = x j \frac{\partial J(\theta)}{\theta_j} = \frac{\partial J(\theta)}{\partial h_\theta(x)} \frac{\partial h_\theta(x)}{\partial \theta^Tx} \frac{\partial \theta^Tx}{\partial \theta_j} \\ \frac{\partial J(\theta)}{\partial h_\theta(x)} = y * \frac{1}{h_\theta(x)} + (y-1)\frac{1}{1 - h_\theta(x)} \\ \frac{\partial h_\theta(x)}{\partial \theta^Tx} = \frac{e^{-\theta^Tx}}{(1+e^{-\theta^Tx})^2} = h_\theta(x)(1- h_\theta(x)) \\ \frac{\partial \theta^Tx}{\partial \theta_j} = x_j θjJ(θ)=hθ(x)J(θ)θTxhθ(x)θjθTxhθ(x)J(θ)=yhθ(x)1+(y1)1hθ(x)1θTxhθ(x)=(1+eθTx)2eθTx=hθ(x)(1hθ(x))θjθTx=xj
可推出:
∂ J ( θ ) θ j = ( y − h θ ( x ) ) x j θ j : = θ j + α ∑ i m ( y i − h θ ( x i ) ) x j i \frac{\partial J(\theta)}{\theta_j} = (y-h_\theta(x))x_j \\ \theta_j := \theta_j + \alpha\sum_i^m(y^{i} - h_\theta(x^i))x_j^i θjJ(θ)=(yhθ(x))xjθj:=θj+αim(yihθ(xi))xji
由该公式可求出最佳回归参数。

模拟算法

这里仍然采用《机器学习实战》内的例子,实战中主要将求参数转为矩阵操作,大大简化计算:

# -*- coding:utf-8 -*-
"""
@author liuning
@date 2019/10/26 16:13
@File logRegres.py 
@Desciption
"""

from numpy import mat, shape, ones, exp, array, arange


def loadDataSet():
    dataMat = []; labelMat = []
    with open('testSet.txt','r') as f:
        for line in f.readlines():
            lineArr = line.strip().split()
            dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
            labelMat.append(int(lineArr[2]))
        return dataMat,labelMat
def sigmoid(inX):
    """
    回归方程
    :param inX: 
    :return: 1/(1+exp(-inX)) S函数
    """""
    return 1/(1+exp(-inX))
def gradAscent(dataMat,labelMat):
    """
    梯度上升
    """
    dataMat = mat(dataMat)
    labelMat = mat(labelMat).transpose()
    n,m = shape(dataMat)
    alpha = 0.001 # 步长
    generations = 500
    weights = ones((m,1))
    # 结合数学推导,得到weights的更新方法
    for k in range(generations):
        h = sigmoid(dataMat*weights)
        error = labelMat - h
        weights = weights + alpha * dataMat.transpose()*error # 极大似然估计法
    return weights
def plotBestFit(wei):
    """
    绘图
    """
    import matplotlib.pyplot as plt
    weights = wei.getA()
    dataMat,labelMat = loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = arange(-3.0,3.0,0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x,y)
    plt.xlabel('X1');plt.ylabel('X2')
    plt.show()
if __name__ == '__main__':
    inMat,labelMat = loadDataSet()
    weights  = gradAscent(inMat,labelMat)
    print(weights)
    plotBestFit(weights)

随机梯度上升(SGA)

当数据量很大时,直接全部进行计算时,计算复杂度太高。随机梯度上升方法一次只用一个样本数据更新参数值,由于可以对参数进行增量式更新,所以是一门“在线学习”算法。

其他类似算法还有小批量梯度上升(Mini-batch Gradient Ascent),该方法每次使用batch_size大小的数据集对参数进行更新,是一个折中办法,常用于深度学习中。

def stocGradAscent0(dataMatrix, classLabel,numIter=150):
    """
    随机梯度上升
    :param dataMatrix:
    :param classLabel:
    :param numIter:
    :return: weights nd.array
    """
    dataMatrix = array(dataMatrix)
    m,n = shape(dataMatrix)
    weights = ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i) + 0.01
            randIndex = int(random.uniform(0, len(dataIndex))) #随机选取一个样本
            h = sigmoid(sum(dataMatrix[randIndex] * weights))
            error = classLabel[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex] # 更新参数
            del(dataIndex[randIndex])
    return weights

其中,alpha会随迭代次数进行修改,alpha会逐渐变小,表示该组数据对参数值的调整影响效果越来越小,但因为由常数值0.01,所以永远不会为0。常数值要根据实际场景进行调整,通常,如果处理的问题是动态变化的,则可以适当加大常数项,确保新的值获得更大的回归系数。

数据样本随机选取可以减少周期性的波动。

实践

案例取自《机器学习实战》第五章

收集数据

https://archive.ics.uci.edu/ml/datasets.php

Horse-Colic 数据集

准备数据

训练数据填充缺失值,测试数据集去掉缺失条目

分析数据

可视化并观察数据

def loadData():
    """
    加载训练数据
    :return:
    """
    dataMat = []; labelMat = []
    with open("horseColicTraining.txt",'r') as f:
        for line in f.readlines():
            lineArr = line.strip().split()
            lineArr = [float(x) for x in lineArr]
            dataMat.append([1.0]+lineArr[0:21])
            labelMat.append(lineArr[21])
    return dataMat,labelMat
def loadTestData():
    """
    加载测试数据
    :return:
    """
    dataMat = []; labelMat = []
    with open("horseColicTest.txt",'r') as f:
        for line in f.readlines():
            lineArr = line.strip().split()
            lineArr = [float(x) for x in lineArr]
            dataMat.append([1.0] + lineArr[0:21])
            labelMat.append(lineArr[21])
    return dataMat,labelMat

训练算法

使用随机梯度下降算法,求最佳回归参数。

def signoid(x):
    return 1/(1+exp(-x))

def stocGradAscent(inMat,labels,numIter=150):
    """
    随机梯度上升
    :param inMat:
    :param labels:
    :param numIter:
    :return:
    """
    inMat = array(inMat)
    m,n = shape(inMat)
    #print(m)
    weights = ones(n)
    for i in range(numIter):
        dataIndex = list(range(m))
        for j in range(m):
            alpha = 4.0/(1.0+i+j) + 0.01
            randIndex = int(random.uniform(1,len(dataIndex)))
            h = signoid(sum(inMat[randIndex]*weights))
            error = labels[randIndex] - h
            weights = weights + alpha * error * inMat[randIndex]
    return weights

测试算法
inMat,labelMat = loadData()
    weights = stocGradAscent(inMat,labelMat,500)
    # print(weights)
    testMat,labelTest = loadTestData()
    m,n = shape(testMat)
    result = []
    for i in range(m):
        if signoid(sum(testMat[i]*weights)) > 0.5:
            result.append(1.0)
        else:
            result.append(0.0)
    error = 0
    for i in range(m):
        if(result[i] != labelTest[i]):
            error += 1
    print("errorRate:"+str(error*100/m)+"%")

补一个Andrew课程作业答案:

# -*- coding:utf-8 -*-
"""

"""
import numpy as np
import matplotlib.pyplot as plt
import h5py

from andrew.logistic.lr_utils import load_dataset

train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()

img_width, img_height = train_set_x_orig.shape[1],train_set_x_orig.shape[2]
print("图片像素:"+ str(img_width)+"*"+str(img_height)+"*3")
num_pics_train = train_set_y.shape[1]
num_pics_test = test_set_y.shape[1]
print("训练图片:"+ str(num_pics_train))
print("测试图片:"+str(num_pics_test))

train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0],-1).T # -1表示reshape自动将剩余维度合并
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0],-1).T

print("训练集维度:" + str(train_set_x_flatten.shape))
print("测试集维度:" + str(test_set_x_flatten.shape))

# 标准化数据
train_set_x = train_set_x_flatten / 255
test_set_x = test_set_x_flatten / 255

print("训练结果集维度:" + str(train_set_y.shape))
def sigmoid(x):
    return 1/(1+np.exp(-x))

def gradient(learning_rate=0.5,iteration=1000):
    w = np.zeros(shape=(train_set_x.shape[0],1))
    b = 0
    costs = []
    epsilon = 1e-5
    for i in range(iteration):
        A = sigmoid(np.dot(w.T, train_set_x) + b)
        if(i%100 == 0):
            costs.append(-(1/train_set_x.shape[1])*np.sum(np.dot(train_set_y,np.log(A+epsilon).T)
                                                          + np.dot((1-train_set_y),np.log(1-A+epsilon).T)))
        dw = np.dot(train_set_x, (A - train_set_y).T)
        db = (1/train_set_x.shape[1])*np.sum(train_set_y - A)
        w = w - learning_rate * dw
        b = b - learning_rate * db
    return w,b,costs

def predict(w,b,X):
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    # w = w.reshape(X.shape[0],1)
    A = sigmoid(np.dot(w.T,X)+b)
    for i in range(A.shape[1]):
        Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0
    return Y_prediction

w,b,costs = gradient(learning_rate=0.1,iteration=2000)
Y_predict_test = predict(w,b,test_set_x)
Y_predict_train = predict(w,b,train_set_x)
print(Y_predict_test)
print("训练集准确性:", format(100 - np.mean(np.abs(Y_predict_train - train_set_y)) * 100), "%")
print("测试集准确性:", format(100 - np.mean(np.abs(Y_predict_test - test_set_y)) * 100), "%")

print(costs)
costs = np.squeeze(costs)
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.show()




参考链接:

《机器学习实战》第五章
《机器学习》3.3
机器学习实战教程(六)
梯度上升算法
具有神经网络思维的Logistic回归

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值