ML Note 2 - Learning Theory

最新推荐文章于 2021-07-24 09:37:45 发布

原创最新推荐文章于 2021-07-24 09:37:45 发布 · 258 阅读

0 ·

CC 4.0 BY-SA版权

ML 专栏收录该内容

9 篇文章

订阅专栏

本文深入探讨了机器学习中的关键概念，包括偏差-方差权衡、经验风险最小化、泛化误差边界等，并介绍了交叉验证、特征选择及在线学习等实用技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Bias / Variance Tradeoff
Model Selection
Online Learning

Bias / Variance Tradeoff

In the diagram below, we modeled the training set with $2$ , $3$ , and $7$ parameters

tradeoff

As we can see, $h_1$ is an underfit of the training set. No matter how large the training set grows, the model cannot capture the structure of the data. This model is said to have high bias.

On the contrary, $h_6$ is an overfit of the training set. The model is too sensitive to random factors that we don’t want to include in our model. This model is said to have large variance.

ERM

For the purpose of simplicity, consider binary classification with

labels $\in \{0,1\}$
training set $\{(x^{(i)}, y^{(i)}) | i = 1,\dots,m\}$
new sample $(x, y)$

As one of the PAC (probably approximately correct) assumptions, assume that there exists some distribution $D$ such that
$(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\mathop{\sim}\limits^{i.i.d.} D$ For a hypothesis $h$ define
$\begin{array}{rcl} Z &=& I\{h(x) \neq y\}\\ Z_i &=& I\{h(x^{(i)}) \neq y^{(i)}\} \end{array}$ Note training error or empirical risk as
$\hat\epsilon(h) = \frac{1}{m} \sum\limits_{i=1}^m Z_i$ and generalization error
$\epsilon(h) = P(Z)$ Then it is obvious that
$Z_1, Z_2, \dots,Z_m \mathop\sim\limits^{i.i.d.} Bern(\epsilon(h))$ Think of the problem as we are picking $h$ from a hypothesis class, for instance
$\{h_\theta|\theta\in\mathbb{R}^{n+1}\}$ and our goal is empirical risk minimization
$\hat h = \arg\min_{h\in H}\hat\epsilon(h)$ Define the theoretically best hypothesis in $H$
$h^* = \arg\min_{h\in H} \epsilon(h)$

Uniform Convergence

Suppose $\{h_1,\dots,h_k\}$ is a finite set. For any $h_i \in H$ donate
$A_i = I\{|\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma\}$ Hoeffding inequality gives that
$P(A_i) \le 2\exp(-2\gamma^2m)$ Using the union bound, we have that
$\begin{array}{rcl} P(\exist h_i \in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma) &=& P(\bigcup\limits_{i=1}^k A_i)\\ &\le& \sum\limits_{i=1}^kP(A_i)\\ &\le& 2k\exp(-2\gamma^2m) \end{array}$ Therefore
$P(\forall h\in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|\le\gamma) \ge 1-2k\exp(-2\gamma^2m)$

Error bound

Given $m$ and $\delta>0$ , with probability $1-\delta$ we have that
$\forall h\in H,\ |\epsilon(h)-\hat\epsilon(h)| \le \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}$ Since
$\hat\epsilon(\hat h) \le \hat\epsilon(h^*)$ we have that
$\epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + 2\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}$ When we expand $H$ to some super-set $\supset H$ , the first term $\min\limits_{h\in H}\epsilon(h)$ can only decrease, while the second term $\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}$ can only increase. This loosely corresponds to the bias-variance tradeoff.

Sample Complexity Bound

Given $\gamma$ and $\delta > 0$ in order for
$\epsilon(\hat h) \le \epsilon(h^*) + 2\gamma$ to be true with probability at least $1-\delta$ , it suffices that
$\ge \frac{1}{2\gamma^2}\log\frac{2k}{\delta} = O(\frac{1}{\gamma^2}\log\frac{k}{\delta})$

VC Dimension

Given a set $\{x^{(1)}, \dots, x^{(d)}\}$ , we say that $H$ shatters $S$ if
$\forall \{y^{(1)},\dots,y^{(d)}\} \in \{0,1\}^d,\ \exist h\in H,\\\ s.t.\ \forall i \in \{1,\dots,d\},\ h(x^{(i)}) = y^{(i)}$ Define Vapnik-Chervonenkis dimension $V C (H)$ to be the size of the largest set that is shattered by $H$ . It can be shown that if $H$ contains all linear classifiers in $n$ dimensional, then $V C (H) = n + 1$ .

For SVMs using kernel, VC dimension is usually small. If
$\exist R,\ s.t. \forall i \in \{1,\dots,m\},\ ||x^{(i)}|| \le R$ then
$\le \lceil\frac{R^2}{4\gamma^2}\rceil + 1$

Let $H$ be given and let $\neq +\infin$ . With probability at least $1-\delta$ , we have that
$\forall h \in H,\ |\hat\epsilon(h) - \epsilon(h)| \le O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg)$ Thus
$\epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg)$ Moreover

For $\epsilon(\hat h) \le \epsilon(h^*) + 2\gamma$ to hold with probability at least $1-\delta$ , it suffices that $O_{\gamma, \delta}(d)$ .

Model Selection

Suppse we have a finite set of models $\{M_1, \dots, M_d\}$ that we are trying to select among. There are several techniques to automatically deal with bias-variance tradeoff.

Cross Validation

In hold-out cross validation or simple cross validation, we
$\begin{aligned} & \text{randomly split } S \text{ into } S_{\text{train}} \text{ and } S_{\text{cv}} (\text{ say } 70\% \text{ to } 30\%)\\ & \text{for i in 1...d}\\ & \qquad \text{train } M_i \text{ on } S_\text{train}\\ & \qquad \text{test } M_i \text{ on } S_\text{cv} \text{ to get } \epsilon_i\\ & \text{choose } M = \arg\min_{M_i} \epsilon_i\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned}$ A main disadvantage of this algorithm is the waste of data. To hold out less data each time, we can use k-fold cross validation.
$\begin{aligned} & \text{randomly split } S \text{ into } k \text{ disjoint subsets } S_1,\dots, S_k\\ & \text{for i in 1...d}\\ & \qquad\text{for j in 1...k}\\ & \qquad\qquad\text{train } M_i \text{ on } S-S_j\\ & \qquad\qquad\text{test } M_i \text{ to get } \epsilon_{ij}\\ & \qquad\epsilon_i = \frac{1}{k}\sum\limits_{j=1}^k\epsilon_{ij}\\ & \text{choose } M = M_{\arg\min\limits_{i} \epsilon_i}\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned}$ A typical choice for $K$ is $10$ . But for problems with really scarce data, we can use $k = m$ , which lead to the leave-one-out cross validation.

Feature Selection

Suppose you have a supervised learning problem where $n$ is very large, but you suspect that only a small number of features are relevant to the task. In such a setting, you can apply some heuristic algorithms to rank every feature and choose the top- $k$ .

Wrapper Model Feature Selection

In a forward search, use $F$ to record the most relevant features
$\begin{aligned} & F := \phi\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \in F)\ \text{continue}\\ & \qquad\qquad F_i := F \cup\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned}$ Similarly, backward search
$\begin{aligned} & F := \{1,\dots,n\}\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \notin F)\ \text{continue}\\ & \qquad\qquad F_i := F -\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned}$ Wrapper feature selection algorithms often work quite well, but can be computationally expensive.

Filter Feature Selection

Compute some simple score $S (i)$ to measure how informative each feature $x_i$ is. Then, simply pick the $k$ features with the largest scores $S (i)$ .

It is common to choose mutual information as $S$
$MI(x_i,y) = \sum\limits_{x_i} \sum\limits_{y} p(x_i,y)\log\frac{p(x_i,y)}{p(x_i)p(y)}$ Note that
$MI(x_i,y) = KL(p(x_i,y)||p(x_i)p(y))$

Online Learning

So far, we have been considering batch learning in which we are first given a training set to learn. Online learning, on the contrary, requires the algorithm to make predictions continuously even while it is learning.

In this setting, the algorithm is first given $x^{(i)}$ and is asked to predict $h(x^{(i)})$ . Then the true $y^{(i)}$ is revealed for the model and this process repeats. Total online error is the total number of errors made by the algorithm during this process.

Assume that the class labels $\in \{-1,1\}$ , in perceptron algorithm with parameters $\theta\in\mathbb{R}^{n+1}$
$g(\theta^Tx)$ where
$\left\{\begin{array}{ll} 1 & \text{if } z \ge 0\\ -1 & \text{otherwise} \end{array}\right.$ Given a training example $(x, y)$ , the parameter updates if $\neq y$
$\theta := \theta + yx$ Suppose that
$\begin{aligned} &\exist D,\ s.t.\ \forall i,\ ||x^{(i)}|| \le D\\ &\exist u(||u|| = 1),\ s.t.\ \forall i,\ y^{(i)}\cdot(u^Tx^{(i)}) \ge \gamma \end{aligned}$ Block and Novikoff shows that the total number of mistakes that the perceptron algorithm makes is at most $(D/\gamma)^2$ .