机器学习初学者
This article is intended to serve as a guide for anyone interested in exploring the fascinating field of Machine Learning (ML) but have no clue where to begin. In this article, I will give some basic concepts which are incumbent on every beginner to understand before taking a deep dive into ML. Grab a coffee and let’s get started!
本文旨在为有兴趣探索机器学习(ML)引人入胜的领域但不知道从何入手的任何人提供指南。 在本文中,我将提供一些基本概念,每个初学者都必须了解这些基本概念,然后才能深入研究ML。 喝杯咖啡,开始吧!
机器学习导论 (Introduction to Machine Learning)
Today, almost every department in academia wants to integrate ML into their researches. In the business world, the demand for ML experts can never be overemphasized. Why is ML highly spoken of? The simple answer is, data is available more than ever before and we want to make sense of those data. ML is really just about making decisions based on algorithms built (trained) with data. Over the years, ML algorithms have achieved great success in a wide variety of fields. Its success stories include disease diagnostics, image recognition, self-driving cars, spam detectors, handwritten digit recognition, to mention but a few.
如今,学术界几乎每个部门都希望将ML集成到他们的研究中。 在商业世界中,对ML专家的需求绝对不能过分强调。 为什么对ML高度评价? 简单的答案是,数据比以往任何时候都可用,我们希望弄清这些数据。 ML实际上只是基于根据数据构建(训练)的算法来做出决策。 多年来,机器学习算法已在许多领域取得了巨大的成功。 它的成功案例包括疾病诊断,图像识别,自动驾驶汽车,垃圾邮件检测器,手写数字识别,仅举几例。
The first and most important aspect of any machine learning task is the dataset (D_n), where n represents the number of observations. The dataset is divided into:- The outcome (Y) we are interested in predicting, also known as the dependent variable and- The feature(s) (X_i’s) which will be used for predicting the outcome, also known as independent variable(s).
任何机器学习任务的第一个也是最重要的方面是数据集(D_n),其中n表示观察数。 数据集分为:-我们感兴趣的预测结果(Y) ,也称为因变量和-将用于预测结果的特征(X_i) ,也称为自变量) 。
The ultimate goal of ML is to estimate (learn) the relationship pattern between the feature(s) and the outcome variable using some algorithm/mathematical mapping (f: X→Y) such that when we feed the algorithm the independent variable(s), it produces a prediction of the unknown outcome.
ML的最终目标是使用某种算法/数学映射(f:X→Y)估计(学习)特征与结果变量之间的关系模式,以便当我们向算法提供数据时,自变量,它可以预测未知结果。
机器学习算法 (Machine Learning Algorithms)
At the very basic level, ML algorithms are of two forms:
在最基本的层次上,机器学习算法有两种形式:
Supervised Learning: this refers to tasks where we have an outcome to predict. That is, every observation of the features has a corresponding outcome. An example of a supervised learning task is predicting customers’ churn based on some influencing features.
监督学习 :这是指我们可以预测结果的任务。 也就是说,对特征的每次观察都有相应的结果。 监督学习任务的一个示例是基于一些影响因素来预测客户的流失。
Unsupervised Learning: this refers to tasks where we have no outcome to predict. Here, rather than predict an outcome, we seek to understand the relationship between the features or between the observations or to detect anomalous observations.
无监督学习 :这是指我们无法预测结果的任务。 在这里,我们而不是预测结果,而是试图了解特征之间或观测值之间的关系,或者检测异常观测值。
Considering the example above, we assume the churn variable does not exist and we either understand the relationship between the features or between the customers based on the features.
考虑上面的示例,我们假设不存在churn变量,并且我们要么了解功能之间的关系,要么基于功能了解客户之间的关系。
It is worth noting that a variable can either be categorical or continuous. For now, let’s focus on the nature of the outcome variable. In Supervised Learning parlance, if the outcome variable is categorical, we have a classification task and if continuous, we are in the presence of a regression task. Categorical implies that the variable is made of distinct categories (e.g. gender has two categories: Male and Female) and continuous implies that the variable is measured (e.g. salary). For the rest of this article, we will focus on Supervised Learning tasks.
值得注意的是,变量可以是分类的,也可以是连续的。 现在,让我们关注结果变量的性质。 用监督学习的话来说,如果结果变量是分类的,则我们有一个分类任务,如果是连续的,则有回归任务。 分类暗示变量由不同类别组成(例如,性别分为两类:男性和女性),连续暗示变量被测量(例如薪水)。 对于本文的其余部分,我们将重点介绍“ 监督学习”任务。
训练和测试集 (Train and Test Sets)
For the algorithm to learn the relationship pattern between the feature(s) and the outcome variable, it has to be exposed to examples. The dataset containing the examples for training a learning machine is called the train set (D^tr). On the other hand, the accuracy of an algorithm is measured on how well it predicts the outcome of observations it has not seen before. The dataset containing the observations not used in training the ML algorithm is called the test set (D^te). So in practice, we divide our dataset into train and test sets, train the algorithm on the train set and evaluate its performance on the test set.
为了使算法学习特征与结果变量之间的关系模式,必须将其公开给示例。 包含用于训练学习机的示例的数据集称为训练集 (D ^ tr)。 另一方面,算法的准确性是根据其预测其从未见过的观测结果的程度来衡量的。 包含未在训练ML算法中使用的观察值的数据集称为测试集(D ^ te)。 因此在实践中,我们将数据集分为训练集和测试集,在训练集上训练算法,并在测试集上评估其性能。
绩效评估 (Performance Evaluation)
Each time we estimate the true outcome (Y) from a trained ML algorithm (f(X)), the discrepancy between the observed and predicted (might be zero especially for classification tasks) must be quantified. The great question is, how do we quantify this discrepancy? This brings the notion of loss function. Loss Function L(⋅,⋅) is a bi-variate function that quantifies the loss (error) we sustain from predicting Y with f(X). Put another way, loss function quantifies how close the prediction f(X) is to the ground truth Y.
每次我们从训练有素的ML算法(f(X))估算出真实结果( Y )时,观测值和预测值之间的差异(对于分类任务可能为零)必须进行量化。 伟大的问题是,我们如何量化这种差异? 这带来了损失函数的概念。 损失函数L(⋅,⋅)是一个双变量函数,用于量化我们根据f(X)预测Y所承受的损失(误差)。 换句话说, 损失函数量化了预测f ( X )与基本事实Y的接近程度。
Regression Loss Function L(Y,f(X))=(Y-f(X))²This is popularly known as the squared error loss and it is simply the square of the difference between the observed and the predicted values. The loss is squared so that they do not cancel out when quantified over the entire dataset.
回归损失函数 L(Y,f(X))=(Yf(X))²通常被称为平方误差损失,它只是观测值和预测值之间的差的平方。 损失是平方的,因此在整个数据集中进行量化时,它们不会抵消。
Classification Loss FunctionL(Y,f(X))=I(Y≠f(X))The function on the right-hand side above is an indicator function that returns 1 if the predicted and the observed values are different and zero otherwise. In practice, it is known as zero-one loss. The idea here is that loss is only incurred when our algorithm misclassifies an observation.
分类损失函数 L(Y,f(X))= I(Y≠f(X))上面右侧的函数是一个指标函数,如果预测值和观测值不同,则返回1,否则返回零。 。 在实践中,这被称为零一损失。 这里的想法是,仅当我们的算法对观察结果进行错误分类时才会造成损失。
It is worth noting that the loss function as defined above corresponds to a single observation. However, in practice, we want to quantify the loss over the entire dataset and this is where the notion of empirical risk comes in. Loss quantified over the entire dataset is called the empirical risk. Our goal in ML is to develop an algorithm such that the empirical risk is as minimum as possible. Empirical risk is also called the cost function or the objective function we want to minimize.
值得注意的是,如上定义的损失函数对应于一个观测值。 但是,在实践中,我们要对整个数据集上的损失进行量化,这就是经验风险的概念出现的地方。对整个数据集进行量化的损失称为经验风险 。 我们在机器学习中的目标是开发一种算法,以使经验风险最小。 经验风险也称为成本函数或我们要最小化的目标函数 。
Empirical Risk
经验风险

In regression sense, R ̂_n (f) is the mean square error (mean of the squared error loss) and in classification sense, it is (1-accuracy).
在回归意义上, R̂_n(f)是均方误差(平方误差损失的平均值),在分类意义上, R R_n(f)是(1-精度)。

It turns out that the empirical risk in the classification sense is simply the misclassification probability (P[Y≠f(X)]). Take an example of binary classification where our interest is classifying whether a patient is diabetic (1) or not diabetic (0). The result obtained is given below
事实证明,分类意义上的经验风险只是错误分类概率(P [Y≠f(X)]) 。 以二进制分类为例,其中我们的兴趣是对患者是糖尿病(1)还是非糖尿病(0)进行分类。 获得的结果如下

In this case
在这种情况下

To get the misclassification probability, we simply divide the number of incorrectly classified observations by the total number of observations in the sample which is (2/10 = 0.2). In probability language:
为了获得错误分类的概率,我们只需将错误分类的观察数除以样本中的观察总数(2/10 = 0.2)。 用概率语言:
P[Y≠f(X)]=1/10⋅2=0.2
P [Y≠f(X)] = 1 /10⋅2= 0.2
With this example, we have verified that (R ̂_n (f)=P[Y≠f(X)]) in the classification sense. It is therefore intuitive to seek a learning machine with as minimum as possible misclassification probability (empirical risk).
在此示例中,我们已验证了分类意义上的(R̂_n(f) = P [Y≠f(X)])。 因此,寻找具有尽可能最小的错误分类概率(经验风险)的学习机是很直观的。
The way the empirical risk is minimized in practice differs from algorithm to algorithm. Exploring empirical risk minimization will open up doors to concepts like bias-variance tradeoff, underfitting, overfitting, resampling et cetera. These concepts will be explored in future articles, for now, let us build our first learning machine in R. We will explore a binary classification task using k-Nearest Neighbour learning machine. The popular iris dataset is used which is freely available in R and our task is to predict if an observation belongs to versicolor or virginica class.
在实践中使经验风险最小化的方法因算法而异。 探索经验风险最小化将为诸如偏差方差权衡,欠拟合,过度拟合,重采样等概念打开大门。 这些概念将在以后的文章中进行探讨,现在,让我们在R中构建我们的第一台学习机。我们将使用k-Nearest Neighbor学习机探索二进制分类任务。 使用了流行的虹膜数据集,该数据集可在R中免费获得,我们的任务是预测观察结果是否属于杂色或维吉尼卡类。


This is a simple demonstration of the concepts discussed above. We got our data, did the train-test split, developed a kNN algorithm on the train set, evaluated it on the test set, and finally compute the accuracy of the model. We achieved an overall accuracy of 93.75% on our test set, this means our algorithm has correctly classified 93.75% of the observation in the test set. Put another way, the probability of our machine correctly classifying an observation it has not seen before is 0.9375. Another convenient interpretation would be; the probability of our machine misclassifying an observation it has not seen before is 0.0625 (1–0.9375). The details of kNN have been left out because it is not the point of
这是上面讨论的概念的简单演示。 我们获取了数据,进行了火车测试拆分,在火车集合上开发了kNN算法,在测试集合上对其进行了评估,最后计算出模型的准确性。 我们在测试集上实现了93.75%的总体准确度,这意味着我们的算法已正确分类了测试集中93.75%的观测值。 换句话说,我们的机器正确分类了之前未见过的观察结果的概率为0.9375。 另一种方便的解释是: 我们的机器将之前从未见过的观测结果错误分类的概率为0.0625(1-0.9375)。 kNN的细节已被忽略,因为它不是重点
翻译自: https://medium.com/zacrac/machine-learning-for-beginners-6e4ec98631b3
机器学习初学者