概念及应用:
logistic回归主要用于分类问题中,遇到k分类问题时则转化为k个二分类问题即可。
logistic回归是将logit曲线套用在解释变量线性组合上,利用极大似然法进行参数估计,将似然函数(二项分布交叉熵)作为目标函数,利用最优化方法(牛顿法、梯度下降法)进行求解。
python实现
数据载入及切分
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 2]
y = y[y != 2]
xtrain,xtest,ytrain,ytest=train_test_split(
X, y, test_size=0.3, random_state=42)
中间函数准备
tip:由于exp(x)呈现指数级增长,易导致float溢出,可以对x范围进行控制防止溢出。
def sigmoid(z):
# #防止溢出在RuntimeWarning: overflow encountered in exp
return 1 / (1.0 + np.exp(-np.clip(z,-100,10000)))
def f(x,w):#x为n*k w为k*1
return sigmoid(x@w )
def predict(x,w):
return np.round(f(x, w))
利用随机梯度下降法进行求解
#损失函数为两个伯努利分布的交叉熵由极大似然估计进行推导
def cross_entropy_loss(y_pred, y_label):
cross_loss=-np.dot(y_label,np.log(y_pred))-np.dot(np.log(1-y_label),1-y_pred)
return cross_loss
def gradient(x, y, w):
y_pred=predict(x,w)
w_grad=np.matmul(x.T,y_pred-y_label)
return w_grad
#随机梯度下降进行迭代
def training(x,y_label,alpha):
dim=x.shape[1]
w = np.random.rand(dim, 1)
for i in range(10):
for index in range(0,len(y_label)):
y_pred=f(np.array(x[index,:],ndmin=2),w)
gradient=np.array(x[index,:],ndmin=2).T@(y_pred-y_label[index])
w-=alpha*gradient
return w
预测
w=training(xtrain,ytrain,0.001)
y_train_pred=predict(xtrain,w)
y_test_pred=predict(xtest,w)
效果评估
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(ytrain, y_train_pred))
print(classification_report(ytest, y_test_pred))
print(confusion_matrix(ytrain, y_train_pred))
print(confusion_matrix(ytest, y_test_pred))
输出结果:
precision recall f1-score support
0 1.00 1.00 1.00 33
1 1.00 1.00 1.00 37
accuracy 1.00 70
macro avg 1.00 1.00 1.00 70
weighted avg 1.00 1.00 1.00 70
precision recall f1-score support
0 1.00 1.00 1.00 17
1 1.00 1.00 1.00 13
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
[[33 0]
[ 0 37]]
[[17 0]
[ 0 13]]
R语言实现
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
ir<-iris[- which(iris$Species == 'setosa'),]
summary(ir)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.000 Min. :1.000
## 1st Qu.:5.800 1st Qu.:2.700 1st Qu.:4.375 1st Qu.:1.300
## Median :6.300 Median :2.900 Median :4.900 Median :1.600
## Mean :6.262 Mean :2.872 Mean :4.906 Mean :1.676
## 3rd Qu.:6.700 3rd Qu.:3.025 3rd Qu.:5.525 3rd Qu.:2.000
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor:50
## virginica :50
##
##
##
ir$Species<-factor(ir$Species, levels = c( 'versicolor', 'virginica'), labels = c(0,1))#level 原始类别 label对于类别名称重命名
ir<-as.data.frame(lapply(ir,as.numeric))
#ir$Species
ir$Species[ir$Species == 2] <-0
数据标准化,选取三列x
x <- ir[,2:4]
y <- ir$Species
m <-dim(x)[1]
n <- dim(x)[2] + 1
x<-data.frame(scale(x))
x$constant <- 1
x<-as.matrix(x) #100*4
估计参数程序:
# param:
# {m:数据行数
# n:数据维度}
mle<-function(x,y,n,m,max_iter){
theta =matrix(data=0.001, nrow = n, ncol = 1)#4*1
thred = 0.001
iters = 1
G = matrix(data=0, nrow = n, ncol = 1)
H =matrix(data=0, nrow = n, ncol = n)
a=1
while( (iters<=max_iter) & (a>=thred)){
print(iters)
iters = iters + 1
z=x%*%theta#100*1
#print(z)
h =1- 1/(1 + exp(z))#100*1
dif = y - h#100*1
G=t(x)%*%dif#4*1 x:4*100
const_sum = h*(1-h)#100*1
H=t(x)%*%(c(const_sum) * x)
theta_pre=theta
theta = theta +solve(H )%*%G
a=sum((theta-theta_pre)**2)/sum(theta_pre**2)
accuracy<-1-sum(abs(round(1- 1/(1 + exp(x%*%theta)))-y))/length(y)
print('accuracy')
print(accuracy)
}
return(theta)
}
theta=mle(x,y,n,m,100)
## [1] 1
## [1] "accuracy"
## [1] 0.96
## [1] 2
## [1] "accuracy"
## [1] 0.95
## [1] 3
## [1] "accuracy"
## [1] 0.95
## [1] 4
## [1] "accuracy"
## [1] 0.97
## [1] 5
## [1] "accuracy"
## [1] 0.97
## [1] 6
## [1] "accuracy"
## [1] 0.97
## [1] 7
## [1] "accuracy"
## [1] 0.97
## [1] 8
## [1] "accuracy"
## [1] 0.97
## [1] 9
## [1] "accuracy"
## [1] 0.97
print('待估参数为:')
## [1] "待估参数为:"
print(theta)
## [,1]
## Sepal.Width 2.78698357
## Petal.Length -6.50073323
## Petal.Width -9.10210495
## constant 0.03436432
y_pred=round(1- 1/(1 + exp(x%*%theta)))
print('结果为:')
## [1] "结果为:"
table(y_pred,y)
## y
## y_pred 0 1
## 0 49 2
## 1 1 48
这篇博客介绍了Logistic回归在分类问题中的应用,通过Python和R语言分别展示了如何建立和训练模型。在Python中,使用了随机梯度下降法优化交叉熵损失函数;在R语言中,实现了最大似然估计的参数求解。实验结果显示,两种实现都能达到高精度的分类效果。

被折叠的 条评论
为什么被折叠?



