Aalto-Machine Learning Supervised Method overview（烂尾工程。。有兴趣的可以看看）

最新推荐文章于 2023-02-16 22:12:05 发布

Dadaist_

最新推荐文章于 2023-02-16 22:12:05 发布

阅读量332

点赞数 1

分类专栏：笔记文章标签：机器学习深度学习人工智能

本文链接：https://blog.youkuaiyun.com/Dadaist_/article/details/122954727

版权

笔记专栏收录该内容

8 篇文章

订阅专栏

本文概述了机器学习的基础理论和不同类型，包括统计学习、贝叶斯统计和信息论的角度。讨论了监督和无监督学习的任务，如分类、回归和聚类。介绍了模型选择的重要性，特别是模型复杂度与泛化能力之间的权衡。线性分类器如感知机和逻辑回归被提及，特别强调了支持向量机和其与感知机的区别。此外，还探讨了模型评估指标如ROC曲线和风险度量，并提出了正则化和结构风险最小化作为优化学习算法的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chapter 1 Introduction

Different theoretical paradigms for machine learning:

Statistic learning theory: assume data is from an unknown distribution P(x), dose not estimate the distribution.
Bayesian statistic: assume prior information on P(x), estimate posterior probabilities
Information theoretical learning: estimate distributions, but does not assume a prior on P(x)

Two different types machine learning:

Supervised machine learning
• raining data contains inputs and outputs (=supervision)
• goal is to predict outputs for new inputs
• typical tasks: classification, regression, ranking
Unsupervised machine learning
• training data does not contain outputs
• goal is to describe and interpret the data
• typical tasks: clustering, association analysis, dimensionality
reduction, pattern discovery

Regression deals with output variables which are numeric

Model or hypothesis h : X → Y that we use to predict outputs given the inputs x

Empirical risk: measure an approximation of the error of the model by computing the average of the losses on individual instances

在这里插入图片描述
Hamming loss is used in multilabel learning:

Generalization error or the true risk(Assuming future examples are independently drawn from the same distribution D that generated the training examples):

Different hypothesis classes or model families in machine learning:

Linear models such as logistic regression and perceptron
Neural networks: compute non-linear input-output mappings through a network of simple computation units
Kernel methods: implicitly compute non-linear mappings into high-dimensional feature spaces (e.g. SVMs)
Ensemble methods: combine simpler models into powerful combined models (e.g. Random Forests)

Consistent hypothesis: a hypothesis correctly classifies all training examples
Version space: the set of all consistent hypotheses of the hypothesis class
Margin is the minimum distance between the decision boundary and a training point

Confusion matrix
在这里插入图片描述

Receiver Operating Characteristics(ROC)
ROC curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.Consider a system which returns an estimate of the class probability P̂(y |x) or a any score that correlates with it.
The higher the ROC curve goes, the better the algorithm or model

Chapter 2 statistic learning theory

Probably Approximate Correct (PAC) Learning framework
Probably approximately correct learning is a theoretical framework
for analysing the generalization performance of machine learning
algorithms.
PAC theory is concerned about upper bounding the probability δ of
”bad” events, those of high generalization error (delta)
在这里插入图片描述
sets the level of generalization error that is of interest to us, say we
are content with predicting incorrectly 10% of the new data points:
= 0.1

An equivalent generalization error bound: for finite hypothesis class - consistent case 在这里插入图片描述
Here H = how many different hypothesis. The more hypotheses there are in H, the more training examples are needed.
Generalization error bound for finite hypothesis class - inconsistent case

Chapter 3 Learning with infinite hypothesis classes

VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y^m
在这里插入图片描述
A finite hypothesis class have VC dimension VCdim(H) ≤ log 2 |H|

Convex polygons have VC dimension = ∞

Generalization bound based on the VC-dimension
在这里插入图片描述

Rademacher complexity
Rademacher complexity defines complexity as the capacity of
hypothesis class to fit random noise
在这里插入图片描述

**加粗样式**
Generalization bound with Rademacher complexity
在这里插入图片描述
the differences between Rademacher complexity and VC dimension

VC dimension is independent of any training sample or distribution generating the data: it measures the worst-case where the data is generated in a bad way for the learner
Rademacher complexity depends on the training sample thus is dependent on the data generating distribution
VC dimension focuses the extreme case of realizing all labelings of the data
Rademacher complexity measures smoothly the ability to realize random labelings
Generalization bounds based on Rademacher Complexity are applicable to any binary classifiers (SVM, neural network, decision tree)
It(Rademacher Complexity) motivates state of the art learning algoritms such as support vector machines.But computing it might be hard, if we need to train a large number of classifiers
Vapnik-Chervonenkis dimension (VCdim) is an alternative that is usually easier to derive analytically

In a nutshell

Vapnik-Chervonenkis dimension lets us study learnability infinite hypothesis classes through the concept of shattering
Rademacher complexity is a practical alternative to VC dimension,giving typically sharper bounds (but requires a lot of simulations to be run)

Chapter 4 model selection

Stochastic scenario
The analysis so far assumed that the labels are deterministic functions of the input. Stochastic scenario relaxes this assumption by assuming the output is a probabilistic function of the input. The input and output is generated by a joint probability distribution D or X × Y. This setup covers different cases when the same input x can have different labels y.

Bayes error
In the stochastic scenario, there is a minimal non-zero error for any hypothesis, called the Bayes error. R* is called Bayes classifier.
在这里插入图片描述

We have a trade-off: increasing the complexity of the hypothesis class:

decreases the approximation error as the class is more likely to contain a hypothesis with error close to the Bayes error
increases the estimation error as finding the good hypothesis becomes more hard and the generalization bounds become looser

Structural risk minimization
SRM aims to minimize the excess risk R(h) − R(h Bayes ) by bounding R(h)
The bound takes both the empirical error and the complexity of the hypothesis class into account (through a penalty term)
The model selection task is to select the optimal index k ∗ and the hypothesis h ∈ H k ∗ that gives the best generalization bound

Generalization bound for SRM:
在这里插入图片描述

SRM model selection: pros and cons
Structural risk minimization benefits from strong learning guarantees, However, the assumption of a countable decomposition of the hypothesis class is a restrictive one. The computational price to
pay is large, especially when a large number of hypothesis classes H k has to be processed.

Regularization-based algorithms

• The larger the training set, the better the generalization error will be (e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test error estimate.
• When the dataset is small generally the training set is taken to be as large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as the computational resources allow

Stratification
Stratification is a process that tries to ensure similar class distributions across the different sets
Simple stratification approach is to divide all classes separately into the training and validation sets and the merge the class-specific training sets into global training set and class-specific validation sets into a global validation set.

Cross-validation
n-fold cross-validation givesus a well-founded way for
model selection. However, only using a single test set may result in unwanted variation
Nested cross-validation solves this problem by using two cross-validation loops
在这里插入图片描述

Chapter 5 Linear classification

Linear classifiers

在这里插入图片描述
Perceptron
• At each step, it finds a training example x i that is incorrectly classified by the current model
• It updates the model by adding the example to the current weight vector together with the label: w (t+1) ← w (t) + y i x i
• This process is continued until incorrectly predicted training examples are not found

The perceptron algorithm can be shown to eventually converge to a consistent hyperplane if the two classes are linearly separable, that is, if there exists a hyperplane that separates the two classes

在这里插入图片描述

The main source of difficulty is the ”step function” shape of the zero-one loss function. It is non-differentiable, so cannot optimize using gradient approaches. It is non-convex, so optimizer susceptible to fall in local minima.
Surrogate loss functions for classification
There are multiple surrogate losses that are convex and differentiable upper bounds to zero-one loss:
• Squared loss - used for regression, not optimal for classification
• Hinge loss - used in Support vector machines (Lecture 6)
• Exponential loss - used in Boosting
• Logistic loss - used in Logistic regression
Logistic regression
在这里插入图片描述
logistic loss:

Logistic regression optimization problem
We will use stochastic gradient descent to incrementally step towards the direction where the objective decreases fastest, the negative gradient
• The stepsize parameter η, also called the learning rate is a critical one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be unnecessarily slow
• Too large stepsize may cause the method to continually overshoot the optimum.

Logistic regression is a classification method that can be interpreted as maximizing odds ratios of conditional class probabilities

Chapter 6 Support vector machines

在这里插入图片描述

regularized learning problem
• First term minimizes a loss function on training data
• Second term, called the regularizer, controls the complexity of the model
• The parameter λ = C 1 controls the balance between the two terms

Difference between SVM and perceptron update
Both share the idea of adding to w the training example multiplied by the label y i x i , SVM does this to all examples that have too small margin, not only misclassified ones