MachineLearning实战及优化_多项目优化和machine unlearning-优快云博客

本文链接：https://blog.youkuaiyun.com/jstxzhangrui/article/details/79836444

本文详细介绍了机器学习实战过程，包括数据获取、预处理、模型训练与优化。通过线性模型展示了损失函数计算、准确率评价、阈值选择、ROC曲线分析以及模型验证方法。此外，探讨了岭回归和Lasso回归在优化中的应用，并强调了特征工程在机器学习中的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MachineLearning实战及优化

引包

import pandas as pd
import numpy as np
import random
from sklearn import datasets,linear_model,model_selection
from sklearn.metrics import roc_curve,auc
import matplotlib.pyplot as plt

1.获取数据

target_url = ("https://archive.ics.uci.edu/ml/machine-learning-databases"
              "/undocumented/connectionist-bench/sonar/sonar.all-data")

data = pd.read_csv(target_url,header=None,prefix="F")
data.head()

output:

F0	F1	F2	F3	F4	F5	…	F56	F57	F58	F59	F60
0	0.0200	0.0371	0.0428	0.0207	0.0954	0.0986	…	0.0180	0.0084	0.0090	0.0032
1	0.0453	0.0523	0.0843	0.0689	0.1183	0.2583	…	0.0140	0.0049	0.0052	0.0044
2	0.0262	0.0582	0.1099	0.1083	0.0974	0.2280	…	0.0316	0.0164	0.0095	0.0078
3	0.0100	0.0171	0.0623	0.0205	0.0205	0.0368	…	0.0050	0.0044	0.0040	0.0117
4	0.0762	0.0666	0.0481	0.0394	0.0590	0.0649	…	0.0072	0.0048	0.0107	0.0094

5 rows × 61 columns

2.数据预处理

我们注意到数据的最后一列是字符串类型，有两个可能的取值（R,M，以上数据只取了头部）

属性分为数值变量(int,float,double)和类别变量(男女，RM)。类别变量之间是没有顺序大小关系的，很多机器算法（SVM,K近邻等）不能处理类别变量，只能处理数值变量，所以我们需要先对类别属性做一个数据的预处理。

# String属性数值化
for i in range(len(data.index)):
    data.iloc[i,-1] = 1 if data.iloc[i,-1]=='R' else -1

output:

F0	F1	F2	F3	F4	F5	…	F56	F57	F58	F59	F60
0	0.0200	0.0371	0.0428	0.0207	0.0954	0.0986	…	0.0180	0.0084	0.0090	0.0032
1	0.0453	0.0523	0.0843	0.0689	0.1183	0.2583	…	0.0140	0.0049	0.0052	0.0044
2	0.0262	0.0582	0.1099	0.1083	0.0974	0.2280	…	0.0316	0.0164	0.0095	0.0078
3	0.0100	0.0171	0.0623	0.0205	0.0205	0.0368	…	0.0050	0.0044	0.0040	0.0117
4	0.0762	0.0666	0.0481	0.0394	0.0590	0.0649	…	0.0072	0.0048	0.0107	0.0094

3.划分测试集和训练集

使用sklearnmodel_selection工具划分数据

# 随机选择25%作为测试集，剩余作为训练集
x_train, x_test, y_train, y_test = 
model_selection.train_test_split(features, label , test_size=0.25, random_state=0)

4. 损失函数

一般测试集中的上的性能差于训练集性能，测试集中的结果更能代表实际的结果(模型的泛化能力)

# 定义线性回归的损失函数，这里采用最小二乘法
def getLinearModelLoss(y_result,y_label):
    return 1/2*sum((y_result - y_label)**2)

5.训练模型

这里采用简单的线性模型

rocksVMineModel = linear_model.LinearRegression()
rocksVMineModel.fit(x_train,y_train)
y_predictions = rocksVMineModel.predict(x_test)
print(y_predictions)

output

[ 0.9986581  -0.37480244 -1.10530037  1.25840044  0.11199301  0.21296523
 -0.9073503  -0.18223021  0.36710594 -1.16509835 -0.91282058 -0.68954176
  0.62070963 -0.15125346  0.4211044   0.70567205 -1.0496584   0.51670381
  1.26917949 -0.58659294 -0.89844225  0.92787838 -1.34932184  0.33586504
 -0.34785318 -0.86395703 -1.08992104 -0.39488961 -0.25195119 -1.20229847
  1.07364242 -1.00959605 -0.70790643  1.09261336 -0.53620214 -0.45691079
 -0.3194265  -1.08756532 -0.74958315 -1.47561133 -0.56484815  0.63582815
 -0.29872177  0.25879189 -0.81791236  0.6262992  -1.7570181  -1.09187623
  1.1579126   0.11256358 -0.54537382 -1.26992022]

6.损失函数计算

print("测试集损失函数为:%f"%getLinearModelLoss(np.array(y_perdictions),np.array(y_test)))

output

测试集损失函数结果为:23.285343

这也是我们模型优化的目标——最小化损失函数

7.准确率评价

我们先定义一个混淆矩阵[TP , FN , FP , TN]

其中TP代表实际值与预测值均为正，FN代表真实值为正，预测值为负，FP代表真实值为负，预测值为正，TN代表真实值为负，预测值为负

显然错误总数为FP和FN之和

# predicted:预测值
# actual:真实值
# threshold: 判定阈值
def confusionMatrix(predicted,actual,threshold):
    if(len(predicted)!=len(actual)):
        return -1
    tp,fp,tn,fn = 0,0,0,0
    for i in range(len(actual)):
        if actual[i] > 0:
            if predicted[i] > threshold:
                tp += 1
            else:
                fp += 1
        else:
            if predicted[i] < threshold:
                tn += 1
            else:
                fn += 1
    return [tp,fp,tn,fn]

confusionResult = confusionMatrix(y_predictions,y_test,0)
print("tp=%d\tfp=%d\ttn=%d\tfn=%d"%(confusionResult[0],
                                  confusionResult[1],confusionResult[2],confusionResult[3]))
print("accuracy:%f"%(1.0*(confusionResult[0]+confusionResult[2])/sum(confusionResult)))

output

tp=15 fp=10 tn=23 fn=4
accuracy:0.730769

8.阈值选择

7中阈值选择为0，对于不同的阈值，准确率不同，0可能并不是一个很好的选择，阈值的选择与数据集的分布有关，本文中的数据集R和M的个数大致相等，所以0可能是一个最优的结果。实际中，准确率并不是唯一影响阈值选择的因素，我们需要考虑tn，tp，fn，fp的成本问题，如癌症预测中就需要尽量降低fp，因为将一个癌症患者预测为正常人是一件不可容忍的事

# 阈值从-1到1，步长0.1，测试准确率
for i in range(20):
    thredshold = (i-10)/1.0
    confusionResult = confusionMatrix(y_predictions,y_test,thredshold)
   print("threadshold:%f\taccuracy:%f"%(thredshold,1.0*(confusionResult[0]+confusionResult[2])/sum(confusionResult)))

output

threadshold:-10.000000  accuracy:0.480769
threadshold:-9.000000   accuracy:0.480769
threadshold:-8.000000   accuracy:0.480769
threadshold:-7.000000   accuracy:0.480769
threadshold:-6.000000   accuracy:0.480769
threadshold:-5.000000   accuracy:0.480769
threadshold:-4.000000   accuracy:0.480769
threadshold:-3.000000   accuracy:0.480769
threadshold:-2.000000   accuracy:0.480769
threadshold:-1.000000   accuracy:0.596154
threadshold:0.000000    accuracy:0.730769
threadshold:1.000000    accuracy:0.615385
threadshold:2.000000    accuracy:0.519231
threadshold:3.000000    accuracy:0.519231
threadshold:4.000000    accuracy:0.519231
threadshold:5.000000    accuracy:0.519231
threadshold:6.000000    accuracy:0.519231
threadshold:7.000000    accuracy:0.519231
threadshold:8.000000    accuracy:0.519231
threadshold:9.000000    accuracy:0.519231

9.ROC曲线

接收者操作特征曲线（receiver operating characteristic curve，或者叫ROC曲线）是一种坐标图式的分析工具，用于

选择最佳的信号侦测模型、舍弃次佳的模型。
在同一模型中设定最佳阈值。

ROC曲线绘制的是TPR随FPR的变化情况，TPR代表正确分类的正样本比例，FPR假正与负样本的比例

T P R = T P T P + F N

$TPR=\frac{TP}{TP+FN}$

F P R = F P T N + F P

$FPR=\frac{FP}{TN+FP}$

使用较小阈值将倾向于将每个样本预测为正，FN的值自然很小，TPR趋向于1

使用较大阈值将倾向于将每个样本预测为负，FP的值自然很小，FPR趋向于0

# 生成ROC
fpr,tpr,thredsholds = roc_curve(y_test,y_predictions)
roc_auc = auc(fpr,tpr)
print(roc_auc)

output

0.757037037037037

关于AUC，点击这里

plt.clf()
plt.plot(fpr,tpr,label='ROC curve(area = %.2f)'%roc_auc)
plt.show()

理想的ROC应该是一条从(0,0)上升到(0,1)的直线，然后横着到(1,1)，也就是AUC值为1

10.优化途径

10.1选择更优模型

使用惩罚线性回归代替普通线性回归，下面介绍两种惩罚线性回归模型，区别主要在于正则项的使用不同

其中λ称为正则化参数，如果λ选取过大，会把所有参数θ均最小化，造成欠拟合，如果λ选取过小，会导致对过拟合问题解决不当，因此λ的选取是一个技术活。

岭回归与Lasso回归最大的区别在于岭回归引入的是L2范数惩罚项，Lasso回归引入的是L1范数惩罚项，Lasso回归能够使得损失函数中的许多θ均变成0，这点要优于岭回归，因为岭回归是要所有的θ均存在的，这样计算量Lasso回归将远远小于岭回归。

以岭回归为例对上述模型进行优化

ridgeRegression = linear_model.Ridge()
ridgeRegression.fit(x_train,y_train)
y_predictions = ridgeRegression.predict(x_test)
confusionResult = confusionMatrix(y_predictions,y_test,0)
print("accuracy:%f"%(1.0*(confusionResult[0]+confusionResult[2])/sum(confusionResult)))

output

accuracy:0.846154

正确率上升了百分之10，这里需要说明此数据特征少，数据量不大，计算量不大，因此不宜采用lasso回归，实际测验结果在这种情况下使用lasso回归可能反而会降低正确率。

10.2数据的预处理

其实特征工程是机器学习领域非常重要的一个环节，有关特征工程的概念，这里我不做多介绍，推荐一篇比较好的文章传送门，这里由于篇幅有限就不做过多介绍了。

11.模型验证——交叉验证法

上述对数据集和测试集划分的方法是使用随机分配，另一种预留数据的方法称为k折交叉验证法 ，关于k折交叉验证法可以自行google

对于原始数据我们要将其一部分分为train_data，一部分分为test_data。train_data用于训练，test_data用于测试准确率。在test_data上测试的结果叫做validation_error。将一个算法作用于一个原始数据，我们不可能只做出随机的划分一次train和test_data，然后得到一个validation_error，就作为衡量这个算法好坏的标准。因为这样存在偶然性。我们必须好多次(k次)的随机的划分train_data和test_data，分别在其上面算出各自的validation_error。这样就有一组validation_error，根据这一组validation_error，就可以较好的准确的衡量算法的好坏。

cross validation是在数据量有限的情况下的非常好的一个evaluate performance的方法。而对原始数据划分出train data和test data的方法有很多种，这也就造成了cross validation的方法有很多种。

sklearn中的cross validation模块，最主要的函数是如下函数：

sklearn.cross_validation.cross_val_score

调用形式是

scores = cross_validation.cross_val_score(model, raw_data, raw_target, cv=5)

from sklearn import cross_validation
scores = -cross_validation.cross_val_score(rocksVMineModel, features, label, cv=5, scoring='mean_squared_error')
print(scores)
print(getLinearModelLoss(np.array(y_perdictions),np.array(y_test)*2)/len(label)*5)

output

[2.21673781 1.53710234 1.9283356  2.48700644 1.0995107 ]
1.9309536225812929

我们把随机划分训练集的那次均方误差也打印出来了，作为对比。使用cross validation可以有效的在数据集有限的情况下对模型做出一个比较合理的评判。

参考文献

《python机器学习》Michael Bowles著

机器学习总结(一)：线性回归、岭回归、Lasso回归

机器学习-sklearn库的Cross Validation