Chapter 1 Introduction
Different theoretical paradigms for machine learning:
- Statistic learning theory: assume data is from an unknown distribution P(x), dose not estimate the distribution.
- Bayesian statistic: assume prior information on P(x), estimate posterior probabilities
- Information theoretical learning: estimate distributions, but does not assume a prior on P(x)
Two different types machine learning:
- Supervised machine learning
• raining data contains inputs and outputs (=supervision)
• goal is to predict outputs for new inputs
• typical tasks: classification, regression, ranking - Unsupervised machine learning
• training data does not contain outputs
• goal is to describe and interpret the data
• typical tasks: clustering, association analysis, dimensionality
reduction, pattern discovery
Regression deals with output variables which are numeric
Model or hypothesis h : X → Y that we use to predict outputs given the inputs x
Empirical risk: measure an approximation of the error of the model by computing the average of the losses on individual instances
Hamming loss is used in multilabel learning:
Generalization error or the true risk(Assuming future examples are independently drawn from the same distribution D that generated the training examples):
Different hypothesis classes or model families in machine learning:
- Linear models such as logistic regression and perceptron
- Neural networks: compute non-linear input-output mappings through a network of simple computation units
- Kernel methods: implicitly compute non-linear mappings into high-dimensional feature spaces (e.g. SVMs)
- Ensemble methods: combine simpler models into powerful combined models (e.g. Random Forests)
Consistent hypothesis: a hypothesis correctly classifies all training examples
Version space: the set of all consistent hypotheses of the hypothesis class
Margin is the minimum distance between the decision boundary and a training point
Confusion matrix
Receiver Operating Characteristics(ROC)
ROC curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.Consider a system which returns an estimate of the class probability P̂(y |x) or a any score that correlates with it.
The higher the ROC curve goes, the better the algorithm or model
Chapter 2 statistic learning theory
Probably Approximate Correct (PAC) Learning framework
Probably approximately correct learning is a theoretical framework
for analysing the generalization performance of machine learning
algorithms.
PAC theory is concerned about upper bounding the probability δ of
”bad” events, those of high generalization error (delta)
An equivalent generalization error bound: for finite hypothesis class - consistent case
Here H = how many different hypothesis. The more hypotheses there are in H, the more training examples are needed.
Generalization error bound for finite hypothesis class - inconsistent case
Chapter 3 Learning with infinite hypothesis classes
VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y^m
A finite hypothesis class have VC dimension VCdim(H) ≤ log 2 |H|
Convex polygons have VC dimension = ∞
Generalization bound based on the VC-dimension
Rademacher complexity
Rademacher complexity defines complexity as the capacity of
hypothesis class to fit random noise
Generalization bound with Rademacher complexity
the differences between Rademacher complexity and VC dimension
- VC dimension is independent of any training sample or distribution generating the data: it measures the worst-case where the data is generated in a bad way for the learner
- Rademacher complexity depends on the training sample thus is dependent on the data generating distribution
- VC dimension focuses the extreme case of realizing all labelings of the data
- Rademacher complexity measures smoothly the ability to realize random labelings
- Generalization bounds based on Rademacher Complexity are applicable to any binary classifiers (SVM, neural network, decision tree)
- It(Rademacher Complexity) motivates state of the art learning algoritms such as support vector machines.But computing it might be hard, if we need to train a large number of classifiers
- Vapnik-Chervonenkis dimension (VCdim) is an alternative that is usually easier to derive analytically
In a nutshell
- Vapnik-Chervonenkis dimension lets us study learnability infinite hypothesis classes through the concept of shattering
- Rademacher complexity is a practical alternative to VC dimension,giving typically sharper bounds (but requires a lot of simulations to be run)
Chapter 4 model selection
Stochastic scenario
The analysis so far assumed that the labels are deterministic functions of the input. Stochastic scenario relaxes this assumption by assuming the output is a probabilistic function of the input. The input and output is generated by a joint probability distribution D or X × Y. This setup covers different cases when the same input x can have different labels y.
Bayes error
In the stochastic scenario, there is a minimal non-zero error for any hypothesis, called the Bayes error. R* is called Bayes classifier.
We have a trade-off: increasing the complexity of the hypothesis class:
- decreases the approximation error as the class is more likely to contain a hypothesis with error close to the Bayes error
- increases the estimation error as finding the good hypothesis becomes more hard and the generalization bounds become looser
Structural risk minimization
SRM aims to minimize the excess risk R(h) − R(h Bayes ) by bounding R(h)
The bound takes both the empirical error and the complexity of the hypothesis class into account (through a penalty term)
The model selection task is to select the optimal index k ∗ and the hypothesis h ∈ H k ∗ that gives the best generalization bound
Generalization bound for SRM:
SRM model selection: pros and cons
Structural risk minimization benefits from strong learning guarantees, However, the assumption of a countable decomposition of the hypothesis class is a restrictive one. The computational price to
pay is large, especially when a large number of hypothesis classes H k has to be processed.
Regularization-based algorithms
• The larger the training set, the better the generalization error will be (e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test error estimate.
• When the dataset is small generally the training set is taken to be as large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as the computational resources allow
Stratification
Stratification is a process that tries to ensure similar class distributions across the different sets
Simple stratification approach is to divide all classes separately into the training and validation sets and the merge the class-specific training sets into global training set and class-specific validation sets into a global validation set.
Cross-validation
n-fold cross-validation givesus a well-founded way for
model selection. However, only using a single test set may result in unwanted variation
Nested cross-validation solves this problem by using two cross-validation loops
Chapter 5 Linear classification
Linear classifiers
Perceptron
• At each step, it finds a training example x i that is incorrectly classified by the current model
• It updates the model by adding the example to the current weight vector together with the label: w (t+1) ← w (t) + y i x i
• This process is continued until incorrectly predicted training examples are not found
The perceptron algorithm can be shown to eventually converge to a consistent hyperplane if the two classes are linearly separable, that is, if there exists a hyperplane that separates the two classes
The main source of difficulty is the ”step function” shape of the zero-one loss function. It is non-differentiable, so cannot optimize using gradient approaches. It is non-convex, so optimizer susceptible to fall in local minima.
Surrogate loss functions for classification
There are multiple surrogate losses that are convex and differentiable upper bounds to zero-one loss:
• Squared loss - used for regression, not optimal for classification
• Hinge loss - used in Support vector machines (Lecture 6)
• Exponential loss - used in Boosting
• Logistic loss - used in Logistic regression
Logistic regression
logistic loss:
Logistic regression optimization problem
We will use stochastic gradient descent to incrementally step towards the direction where the objective decreases fastest, the negative gradient
• The stepsize parameter η, also called the learning rate is a critical one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be unnecessarily slow
• Too large stepsize may cause the method to continually overshoot the optimum.
Logistic regression is a classification method that can be interpreted as maximizing odds ratios of conditional class probabilities
Chapter 6 Support vector machines
regularized learning problem
• First term minimizes a loss function on training data
• Second term, called the regularizer, controls the complexity of the model
• The parameter λ = C 1 controls the balance between the two terms
Difference between SVM and perceptron update
Both share the idea of adding to w the training example multiplied by the label y i x i , SVM does this to all examples that have too small margin, not only misclassified ones