16、机器学习概念与深入剖析-优快云博客

本文链接：https://blog.youkuaiyun.com/pytorch8learner/article/details/149549779

机器学习概念与深入剖析

1. 训练集、测试集划分与交叉验证概念

在机器学习中，训练集、测试集划分以及交叉验证是基础概念，这也是纯统计方法与机器学习方法的显著区别所在。在统计建模任务中，可能会进行回归、参数/非参数检验等操作；而在机器学习中，算法方法会结合对结果的迭代评估以及对模型的逐步改进。

1.1 数据划分为训练集和测试集

每个机器学习建模过程通常从数据清洗开始，接下来就是将数据划分为训练集和测试集。一般是从数据中随机选择部分行用于创建模型，未被选中的行则用于测试最终模型。常见的划分比例在 70 - 80%（训练数据与测试数据）之间，例如 80 - 20 划分，即 80%的数据用于创建模型，20%的数据用于测试模型。

可以使用 createDataPartition 函数进行划分，示例代码如下：

training_index<- createDataPartition(diab$diabetes, p = 0.80, list = FALSE, times = 1)
length(training_index) # Number of items that we will select for the train set
nrow(diab) # The total number of rows in the dataset
# Creating the training set, this is the data we will use to build our model
diab_train<- diab[training_index,