ML Note 2 - Learning Theory

本文深入探讨了机器学习中的关键概念,包括偏差-方差权衡、经验风险最小化、泛化误差边界等,并介绍了交叉验证、特征选择及在线学习等实用技术。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Bias / Variance Tradeoff

In the diagram below, we modeled the training set with 2 2 2, 3 3 3, and 7 7 7 parameters

tradeoff

As we can see, h 1 h_1 h1 is an underfit of the training set. No matter how large the training set grows, the model cannot capture the structure of the data. This model is said to have high bias.

On the contrary, h 6 h_6 h6 is an overfit of the training set. The model is too sensitive to random factors that we don’t want to include in our model. This model is said to have large variance.

ERM

For the purpose of simplicity, consider binary classification with

  • labels y ∈ { 0 , 1 } y \in \{0,1\} y{0,1}
  • training set S = { ( x ( i ) , y ( i ) ) ∣ i = 1 , … , m } S = \{(x^{(i)}, y^{(i)}) | i = 1,\dots,m\} S={(x(i),y(i))i=1,,m}
  • new sample ( x , y ) (x, y) (x,y)

As one of the PAC (probably approximately correct) assumptions, assume that there exists some distribution D D D such that
( x , y ) , ( x ( 1 ) , y ( 1 ) ) , … , ( x ( m ) , y ( m ) ) ∼ i . i . d . D (x,y), (x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\mathop{\sim}\limits^{i.i.d.} D (x,y),(x(1),y(1)),,(x(m),y(m))i.i.d.D For a hypothesis h h h define
Z = I { h ( x ) ≠ y } Z i = I { h ( x ( i ) ) ≠ y ( i ) } \begin{array}{rcl} Z &=& I\{h(x) \neq y\}\\ Z_i &=& I\{h(x^{(i)}) \neq y^{(i)}\} \end{array} ZZi==I{h(x)=y}I{h(x(i))=y(i)} Note training error or empirical risk as
ϵ ^ ( h ) = 1 m ∑ i = 1 m Z i \hat\epsilon(h) = \frac{1}{m} \sum\limits_{i=1}^m Z_i ϵ^(h)=m1i=1mZi and generalization error
ϵ ( h ) = P ( Z ) \epsilon(h) = P(Z) ϵ(h)=P(Z) Then it is obvious that
Z , Z 1 , Z 2 , … , Z m ∼ i . i . d . B e r n ( ϵ ( h ) ) Z, Z_1, Z_2, \dots,Z_m \mathop\sim\limits^{i.i.d.} Bern(\epsilon(h)) Z,Z1,Z2,,Zmi.i.d.Bern(ϵ(h)) Think of the problem as we are picking h h h from a hypothesis class, for instance
H = { h θ ∣ θ ∈ R n + 1 } H = \{h_\theta|\theta\in\mathbb{R}^{n+1}\} H={hθθRn+1} and our goal is empirical risk minimization
h ^ = arg ⁡ min ⁡ h ∈ H ϵ ^ ( h ) \hat h = \arg\min_{h\in H}\hat\epsilon(h) h^=arghHminϵ^(h) Define the theoretically best hypothesis in H H H
h ∗ = arg ⁡ min ⁡ h ∈ H ϵ ( h ) h^* = \arg\min_{h\in H} \epsilon(h) h=arghHminϵ(h)

Uniform Convergence

Suppose H = { h 1 , … , h k } H = \{h_1,\dots,h_k\} H={h1,,hk} is a finite set. For any h i ∈ H h_i \in H hiH donate
A i = I { ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ > γ } A_i = I\{|\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma\} Ai=I{ϵ(hi)ϵ^(hi)>γ} Hoeffding inequality gives that
P ( A i ) ≤ 2 exp ⁡ ( − 2 γ 2 m ) P(A_i) \le 2\exp(-2\gamma^2m) P(Ai)2exp(2γ2m) Using the union bound, we have that
P ( ∃ h i ∈ H ,   ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ > γ ) = P ( ⋃ i = 1 k A i ) ≤ ∑ i = 1 k P ( A i ) ≤ 2 k exp ⁡ ( − 2 γ 2 m ) \begin{array}{rcl} P(\exist h_i \in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|>\gamma) &=& P(\bigcup\limits_{i=1}^k A_i)\\ &\le& \sum\limits_{i=1}^kP(A_i)\\ &\le& 2k\exp(-2\gamma^2m) \end{array} P(hiH, ϵ(hi)ϵ^(hi)>γ)=P(i=1kAi)i=1kP(Ai)2kexp(2γ2m) Therefore
P ( ∀ h ∈ H ,   ∣ ϵ ( h i ) − ϵ ^ ( h i ) ∣ ≤ γ ) ≥ 1 − 2 k exp ⁡ ( − 2 γ 2 m ) P(\forall h\in H,\ |\epsilon(h_i) - \hat\epsilon(h_i)|\le\gamma) \ge 1-2k\exp(-2\gamma^2m) P(hH, ϵ(hi)ϵ^(hi)γ)12kexp(2γ2m)

Error bound

Given m m m and δ > 0 \delta>0 δ>0, with probability 1 − δ 1-\delta 1δ we have that
∀ h ∈ H ,   ∣ ϵ ( h ) − ϵ ^ ( h ) ∣ ≤ 1 2 m log ⁡ 2 k δ \forall h\in H,\ |\epsilon(h)-\hat\epsilon(h)| \le \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} hH, ϵ(h)ϵ^(h)2m1logδ2k Since
ϵ ^ ( h ^ ) ≤ ϵ ^ ( h ∗ ) \hat\epsilon(\hat h) \le \hat\epsilon(h^*) ϵ^(h^)ϵ^(h) we have that
ϵ ( h ^ ) ≤ ( min ⁡ h ∈ H ϵ ( h ) ) + 2 1 2 m log ⁡ 2 k δ \epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + 2\sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} ϵ(h^)(hHminϵ(h))+22m1logδ2k When we expand H H H to some super-set H ’ ⊃ H H’ \supset H HH, the first term min ⁡ h ∈ H ϵ ( h ) \min\limits_{h\in H}\epsilon(h) hHminϵ(h) can only decrease, while the second term 1 2 m log ⁡ 2 k δ \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}} 2m1logδ2k can only increase. This loosely corresponds to the bias-variance tradeoff.

Sample Complexity Bound

Given γ \gamma γ and δ > 0 \delta > 0 δ>0 in order for
ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 γ \epsilon(\hat h) \le \epsilon(h^*) + 2\gamma ϵ(h^)ϵ(h)+2γ to be true with probability at least 1 − δ 1-\delta 1δ , it suffices that
m ≥ 1 2 γ 2 log ⁡ 2 k δ = O ( 1 γ 2 log ⁡ k δ ) m \ge \frac{1}{2\gamma^2}\log\frac{2k}{\delta} = O(\frac{1}{\gamma^2}\log\frac{k}{\delta}) m2γ21logδ2k=O(γ21logδk)

VC Dimension

Given a set S = { x ( 1 ) , … , x ( d ) } S = \{x^{(1)}, \dots, x^{(d)}\} S={x(1),,x(d)}, we say that H H H shatters S S S if
∀ { y ( 1 ) , … , y ( d ) } ∈ { 0 , 1 } d ,   ∃ h ∈ H ,   s . t .   ∀ i ∈ { 1 , … , d } ,   h ( x ( i ) ) = y ( i ) \forall \{y^{(1)},\dots,y^{(d)}\} \in \{0,1\}^d,\ \exist h\in H,\\\ s.t.\ \forall i \in \{1,\dots,d\},\ h(x^{(i)}) = y^{(i)} {y(1),,y(d)}{0,1}d, hH, s.t. i{1,,d}, h(x(i))=y(i) Define Vapnik-Chervonenkis dimension V C ( H ) VC(H) VC(H) to be the size of the largest set that is shattered by H H H. It can be shown that if H H H contains all linear classifiers in n n n dimensional, then V C ( H ) = n + 1 VC(H) = n+1 VC(H)=n+1.

For SVMs using kernel, VC dimension is usually small. If
∃ R ,   s . t . ∀ i ∈ { 1 , … , m } ,   ∣ ∣ x ( i ) ∣ ∣ ≤ R \exist R,\ s.t. \forall i \in \{1,\dots,m\},\ ||x^{(i)}|| \le R R, s.t.i{1,,m}, x(i)R then
V C ( H ) ≤ ⌈ R 2 4 γ 2 ⌉ + 1 VC(H) \le \lceil\frac{R^2}{4\gamma^2}\rceil + 1 VC(H)4γ2R2+1


Let H H H be given and let d = V C ( H ) ≠ + ∞ d = VC(H) \neq +\infin d=VC(H)=+. With probability at least 1 − δ 1-\delta 1δ, we have that
∀ h ∈ H ,   ∣ ϵ ^ ( h ) − ϵ ( h ) ∣ ≤ O ( d m log ⁡ m d + 1 m log ⁡ 1 δ ) \forall h \in H,\ |\hat\epsilon(h) - \epsilon(h)| \le O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg) hH, ϵ^(h)ϵ(h)O(mdlogdm+m1logδ1 ) Thus
ϵ ( h ^ ) ≤ ( min ⁡ h ∈ H ϵ ( h ) ) + O ( d m log ⁡ m d + 1 m log ⁡ 1 δ ) \epsilon(\hat h) \le \Big(\min_{h\in H}\epsilon(h)\Big) + O\Bigg(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\Bigg) ϵ(h^)(hHminϵ(h))+O(mdlogdm+m1logδ1 ) Moreover

For ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 γ \epsilon(\hat h) \le \epsilon(h^*) + 2\gamma ϵ(h^)ϵ(h)+2γ to hold with probability at least 1 − δ 1-\delta 1δ, it suffices that m = O γ , δ ( d ) m = O_{\gamma, \delta}(d) m=Oγ,δ(d).

Model Selection

Suppse we have a finite set of models M = { M 1 , … , M d } M = \{M_1, \dots, M_d\} M={M1,,Md} that we are trying to select among. There are several techniques to automatically deal with bias-variance tradeoff.

Cross Validation

In hold-out cross validation or simple cross validation, we
randomly split  S  into  S train  and  S cv (  say  70 %  to  30 % ) for i in 1...d train  M i  on  S train test  M i  on  S cv  to get  ϵ i choose  M = arg ⁡ min ⁡ M i ϵ i ( optional )  retrain  M  on  S \begin{aligned} & \text{randomly split } S \text{ into } S_{\text{train}} \text{ and } S_{\text{cv}} (\text{ say } 70\% \text{ to } 30\%)\\ & \text{for i in 1...d}\\ & \qquad \text{train } M_i \text{ on } S_\text{train}\\ & \qquad \text{test } M_i \text{ on } S_\text{cv} \text{ to get } \epsilon_i\\ & \text{choose } M = \arg\min_{M_i} \epsilon_i\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned} randomly split S into Strain and Scv( say 70% to 30%)for i in 1...dtrain Mi on Straintest Mi on Scv to get ϵichoose M=argMiminϵi(optional) retrain M on S A main disadvantage of this algorithm is the waste of data. To hold out less data each time, we can use k-fold cross validation.
randomly split  S  into  k  disjoint subsets  S 1 , … , S k for i in 1...d for j in 1...k train  M i  on  S − S j test  M i  to get  ϵ i j ϵ i = 1 k ∑ j = 1 k ϵ i j choose  M = M arg ⁡ min ⁡ i ϵ i ( optional )  retrain  M  on  S \begin{aligned} & \text{randomly split } S \text{ into } k \text{ disjoint subsets } S_1,\dots, S_k\\ & \text{for i in 1...d}\\ & \qquad\text{for j in 1...k}\\ & \qquad\qquad\text{train } M_i \text{ on } S-S_j\\ & \qquad\qquad\text{test } M_i \text{ to get } \epsilon_{ij}\\ & \qquad\epsilon_i = \frac{1}{k}\sum\limits_{j=1}^k\epsilon_{ij}\\ & \text{choose } M = M_{\arg\min\limits_{i} \epsilon_i}\\ & (\text{optional})\text{ retrain } M \text{ on } S \end{aligned} randomly split S into k disjoint subsets S1,,Skfor i in 1...dfor j in 1...ktrain Mi on SSjtest Mi to get ϵijϵi=k1j=1kϵijchoose M=Margiminϵi(optional) retrain M on S A typical choice for K K K is 10 10 10. But for problems with really scarce data, we can use k = m k=m k=m, which lead to the leave-one-out cross validation.

Feature Selection

Suppose you have a supervised learning problem where n n n is very large, but you suspect that only a small number of features are relevant to the task. In such a setting, you can apply some heuristic algorithms to rank every feature and choose the top- k k k.

Wrapper Model Feature Selection

In a forward search, use F F F to record the most relevant features
F : = ϕ repeat  { for i in 1...n  { if  ( i ∈ F )  continue F i : = F ∪ { i } } cross validate over  { M i ∣ M i  depends only on  F i } F : = result } \begin{aligned} & F := \phi\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \in F)\ \text{continue}\\ & \qquad\qquad F_i := F \cup\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned} F:=ϕrepeat {for i in 1...n {if (iF) continueFi:=F{i}}cross validate over {MiMi depends only on Fi}F:=result} Similarly, backward search
F : = { 1 , … , n } repeat  { for i in 1...n  { if  ( i ∉ F )  continue F i : = F − { i } } cross validate over  { M i ∣ M i  depends only on  F i } F : = result } \begin{aligned} & F := \{1,\dots,n\}\\ & \text{repeat }\{\\ & \qquad\text{for i in 1...n }\{\\ & \qquad\qquad\text{if }(i \notin F)\ \text{continue}\\ & \qquad\qquad F_i := F -\{i\}\\ & \qquad\}\\ & \qquad\text{cross validate over } \{M_i| M_i \text{ depends only on } F_i\}\\ & \qquad F := \text{result}\\ & \} \end{aligned} F:={1,,n}repeat {for i in 1...n {if (i/F) continueFi:=F{i}}cross validate over {MiMi depends only on Fi}F:=result} Wrapper feature selection algorithms often work quite well, but can be computationally expensive.

Filter Feature Selection

Compute some simple score S ( i ) S(i) S(i) to measure how informative each feature x i x_i xi is. Then, simply pick the k k k features with the largest scores S ( i ) S(i) S(i).

It is common to choose mutual information as S S S
S ( i ) = M I ( x i , y ) = ∑ x i ∑ y p ( x i , y ) log ⁡ p ( x i , y ) p ( x i ) p ( y ) S(i) = MI(x_i,y) = \sum\limits_{x_i} \sum\limits_{y} p(x_i,y)\log\frac{p(x_i,y)}{p(x_i)p(y)} S(i)=MI(xi,y)=xiyp(xi,y)logp(xi)p(y)p(xi,y) Note that
M I ( x i , y ) = K L ( p ( x i , y ) ∣ ∣ p ( x i ) p ( y ) ) MI(x_i,y) = KL(p(x_i,y)||p(x_i)p(y)) MI(xi,y)=KL(p(xi,y)p(xi)p(y))

Online Learning

So far, we have been considering batch learning in which we are first given a training set to learn. Online learning, on the contrary, requires the algorithm to make predictions continuously even while it is learning.

In this setting, the algorithm is first given x ( i ) x^{(i)} x(i) and is asked to predict h ( x ( i ) ) h(x^{(i)}) h(x(i)). Then the true y ( i ) y^{(i)} y(i) is revealed for the model and this process repeats. Total online error is the total number of errors made by the algorithm during this process.

Assume that the class labels y ∈ { − 1 , 1 } y \in \{-1,1\} y{1,1}, in perceptron algorithm with parameters θ ∈ R n + 1 \theta\in\mathbb{R}^{n+1} θRn+1
h ( x ) = g ( θ T x ) h(x) = g(\theta^Tx) h(x)=g(θTx) where
g ( z ) = { 1 if  z ≥ 0 − 1 otherwise g(z) = \left\{\begin{array}{ll} 1 & \text{if } z \ge 0\\ -1 & \text{otherwise} \end{array}\right. g(z)={11if z0otherwise Given a training example ( x , y ) (x,y) (x,y), the parameter updates if h ( x ) ≠ y h(x) \neq y h(x)=y
θ : = θ + y x \theta := \theta + yx θ:=θ+yx Suppose that
∃ D ,   s . t .   ∀ i ,   ∣ ∣ x ( i ) ∣ ∣ ≤ D ∃ u ( ∣ ∣ u ∣ ∣ = 1 ) ,   s . t .   ∀ i ,   y ( i ) ⋅ ( u T x ( i ) ) ≥ γ \begin{aligned} &\exist D,\ s.t.\ \forall i,\ ||x^{(i)}|| \le D\\ &\exist u(||u|| = 1),\ s.t.\ \forall i,\ y^{(i)}\cdot(u^Tx^{(i)}) \ge \gamma \end{aligned} D, s.t. i, x(i)Du(u=1), s.t. i, y(i)(uTx(i))γ Block and Novikoff shows that the total number of mistakes that the perceptron algorithm makes is at most ( D / γ ) 2 (D/\gamma)^2 (D/γ)2.

03-27
### 关于 Note 的上下文或含义 在 IT 领域中,“note”的具体意义取决于其使用的场景和语境。以下是几种常见的解释: #### 1. **作为技术文档中的注释** 在编程和技术文档中,“note”通常表示一种提醒或补充说明的信息,用于帮助开发者更好地理解某些特定的功能、参数或者行为。例如,在函数定义中提到“Context only makes sense inside of functions”,这表明上下文(context)的概念仅适用于函数内部[^1]。 #### 2. **设备状态描述中的备注** 在硬件领域,“note”可以用来标注某种设备状态的具体特性。比如,D2 设备状态的意义由各个设备类自行定义,许多设备类甚至不支持 D2 状态。一般来说,D2 节能效果优于 D1 或 D0,但会牺牲更多的设备上下文保存能力。当总线进入 D2 状态时,可能会导致设备失去部分功能上下文(例如通过降低总线功率来强制关闭一些设备功能)。这种情况下,“note”是对该状态特性的进一步解释[^2]。 #### 3. **配置文件中的标记** 在软件开发过程中,“note”也可能指代某个配置项的作用或用途。例如,`background_job_compile` 和 `dalvik.vm.dex2oat-threads` 是 Android 平台上的编译选项,其中前者可能依赖后者作为备用属性设置。这里的“note”可以帮助开发者记住这些配置之间的关系及其作用[^3]。 如果上述内容无法完全匹配您的需求,请提供更多背景信息以便更精准地解答。 ```python # 示例代码:如何在 Python 中添加注释以辅助理解 def calculate_area(length, width): """ This function calculates the area of a rectangle. Note: Ensure that both length and width are positive numbers before calling this function. """ if length <= 0 or width <= 0: raise ValueError("Length and width must be greater than zero.") return length * width ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

LutingWang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值