李宏毅Machine Learning 学习笔记1 Regression

本文深入浅出地介绍了回归算法的基本概念及实现步骤,包括模型选择、损失函数定义、梯度下降法应用等内容,并探讨了过拟合问题及其解决方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1 Regression

Regression :output scalar

什么是回归?output是一个数值的就是回归。

Step 1: Model (function set )

A set of function

f1:y=10.0+9.0⋅xcpf_1:y = 10.0+9.0\cdot x_{cp}f1:y=10.0+9.0xcp

f2:y=9.8+9.2⋅xcpf_2:y = 9.8+9.2\cdot x_{cp}f2:y=9.8+9.2xcp

f3:y=−0.8−1.2⋅xcpf_3:y = -0.8-1.2\cdot x_{cp}f3:y=0.81.2xcp

Linear Model

xi :输入值x的一个属性(feature 特征值)
wi :weight ,b :bias

y=b+∑wixiy = b + \sum w_ix_iy=b+wixi

Step 2 : Goodness of function

y=b+w⋅xcpy = b + w\cdot x_{cp}y=b+wxcp

y hat 表示这是一个正确的数字
上标表示一个整体的资料,
下标表示这个资料里的某一个属性。

衡量function 需要 loss Function

Loss function

L(f)=∑n=110(y^n−f(xcpn))2L(f) = \sum_{n=1}^{10} (\hat{y}^n-f(x^n_ {cp}) )^2L(f)=n=110(y^nf(xcpn))2

Lost function 是 function 的 function

L(f) ——> L(w,b)

L(f)=L(w,b)=∑n=110(y^n−(b+w⋅xcpn))2L(f) = L(w,b)= \sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2L(f)=L(w,b)=n=110(y^n(b+wxcpn))2

Step 3: Best function

pick the Best function

f∗=arg⁡min⁡fL(f)f^*= {\arg \min_{f}}L(f)f=argminfL(f)

w∗,b∗=arg⁡min⁡w,bL(w,b)w^*,b^*= {\arg \min_{w,b}}L(w,b)w,b=argminw,bL(w,b)

=arg⁡min⁡w,b∑n=110(y^n−(b+w⋅xcpn))2= {\arg \min_{w,b}}\sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2=argminw,bn=110(y^n(b+wxcpn))2

Step 3: Gradient Descent

单个参数

w∗=arg⁡min⁡wL(w)w^*= {\arg \min_w}L(w)w=argminwL(w)

  • pick an inital value w0
  • Compute

$ \frac{dL}{dW}|_{w=w_0} $

$ w_1 \leftarrow w_0-\eta\frac{dL}{dW}|_{w=w_0}$

  • Compute

$ \frac{dL}{dW}|_{w=w_1} $

w2←w1−ηdLdW∣w=w1w_2 \leftarrow w_1-\eta\frac{dL}{dW}|_{w=w_1}w2w1ηdWdLw=w1

两个参数

w∗,b∗=arg⁡min⁡w,bL(w,b)w^*,b^*= {\arg \min_{w,b}}L(w,b)w,b=argminw,bL(w,b)

  • pick an inital value w0
  • Compute (复习高等数学中如何求偏导)\

∂L∂W∣w=w0,b=b0,∂L∂b∣b=b0,w=w0,\frac{\partial L}{\partial W}|_{w=w_0,b=b_0} , \frac{\partial L}{\partial b}|_{b=b_0,w=w_0} ,WLw=w0,b=b0,bLb=b0,w=w0,

w1←w0−η∂L∂W∣w=w0,b=b0w_1 \leftarrow w_0-\eta\frac{\partial L}{\partial W}|_{w=w_0,b=b_0}w1w0ηWLw=w0,b=b0

b1←b0−η∂L∂b∣w=w0,b=b0b_1 \leftarrow b_0-\eta\frac{\partial L}{\partial b}|_{w=w_0,b=b_0}b1b0ηbLw=w0,b=b0

  • Compute

∂L∂W∣w=w1,b=b1,∂L∂b∣b=b1,w=w1,\frac{\partial L}{\partial W}|_{w=w_1,b=b_1} , \frac{\partial L}{\partial b}|_{b=b_1,w=w_1} ,WLw=w1,b=b1,bLb=b1,w=w1,

w2←w1−η∂L∂W∣b=b1,w=w1w_2 \leftarrow w_1-\eta\frac{\partial L}{\partial W}|_{b=b_1,w=w_1}w2w1ηWLb=b1,w=w1

b2←b1−η∂L∂b∣b=b1,w=w1b_2 \leftarrow b_1-\eta\frac{\partial L}{\partial b}|_{b=b_1,w=w_1}b2b1ηbLb=b1,w=w1

Problem

globel minima
stuck at local minima
stuck at saddle point
very slow at the plateau

Linear Regression 的 lost function 是一个凸函数,不必担心局部最小值的问题

Learning Rate

η\etaη
Learning Rate 控制步子大小、学习速度。

another linear model

y=b+W1⋅Xcp+W2⋅(Xcp)2y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2y=b+W1Xcp+W2(Xcp)2

y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3y=b+W1Xcp+W2(Xcp)2+W3(Xcp)3

y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3+W4⋅(Xcp)4y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4y=b+W1Xcp+W2(Xcp)2+W3(Xcp)3+W4(Xcp)4

y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3+W4⋅(Xcp)4+W5⋅(Xcp)5y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4+W_5\cdot( X_{cp})^5y=b+W1Xcp+W2(Xcp)2+W3(Xcp)3+W4(Xcp)4+W5(Xcp)5

所谓一个model是不是linear 是指他的参数对他的output 是不是linear。

  • A more complex model yields lower error on training data.
    If we can truly find the best function

Model Selection

modelTrainingTesting
131.935.0
215.418.4
315.318.1
414.928.2
512.8232.1
  • A more complex model does not always lead to better performance on testing data.
  • This is Overfitting

复杂模型的model space涵盖了简单模型的model space,因此在training data上的错误率更小,但并不意味着在testing data 上错误率更小。模型太复杂会出现overfitting。

What are the hidden factors?

考虑pakemon种类对cp值的影响。

Back to step 1: Redesign the Model

ifxs=Pidgey:y=b1+w1⋅xcpif x_s=Pidgey: y = b_1+w_1\cdot x_{cp}ifxs=Pidgey:y=b1+w1xcp

ifxs=Weedle:y=b2+w2⋅xcpif x_s=Weedle: y = b_2+w_2\cdot x_{cp}ifxs=Weedle:y=b2+w2xcp

ifxs=Caterpie:y=b3+w3⋅xcpif x_s=Caterpie: y = b_3+w_3\cdot x_{cp}ifxs=Caterpie:y=b3+w3xcp

ifxs=Eevee:y=b4+w4⋅xcpif x_s=Eevee: y = b_4+w_4\cdot x_{cp}ifxs=Eevee:y=b4+w4xcp

↓\downarrow

y=b1⋅δ(xs=Pidgey)+w1⋅δ(xs=Pidgey)⋅xcpy = b_1 \cdot \delta(x_s=Pidgey) +w_1\cdot\delta(x_s=Pidgey)\cdot x_{cp}y=b1δ(xs=Pidgey)+w1δ(xs=Pidgey)xcp

+b2⋅δ(xs=Weedle)+w2⋅δ(xs=Weedle)⋅xcp+b_2 \cdot \delta(x_s=Weedle) +w_2\cdot\delta(x_s=Weedle)\cdot x_{cp}+b2δ(xs=Weedle)+w2δ(xs=Weedle)xcp

+b3⋅δ(xs=Caterpie)+w3⋅δ(xs=Pidgey)⋅xcp+b_3 \cdot \delta(x_s=Caterpie) +w_3\cdot\delta(x_s=Pidgey)\cdot x_{cp}+b3δ(xs=Caterpie)+w3δ(xs=Pidgey)xcp

+b4⋅δ(xs=Caterpie)+w4⋅δ(xs=Pidgey)⋅xcp+b_4 \cdot \delta(x_s=Caterpie) +w_4\cdot\delta(x_s=Pidgey)\cdot x_{cp}+b4δ(xs=Caterpie)+w4δ(xs=Pidgey)xcp

Training error = 3.8 ,Testing Error= 14.3
这个模型在测试集上有更好的表现。

Are there any other hidden factors?

hp值,体重,高度对cp值的影响。

Back to step 1: Redesign the Model Again

ifxs=Pidgey:y,=b1+w1⋅xcp+w5⋅(xcp)2if x_s=Pidgey: y^, = b_1+w_1\cdot x_{cp} + w_5\cdot (x_{cp})^2ifxs=Pidgey:y,=b1+w1xcp+w5(xcp)2

ifxs=Weedle:y,=b2+w2⋅xcp+w6⋅(xcp)2if x_s=Weedle: y^, = b_2+w_2\cdot x_{cp}+ w_6\cdot (x_{cp})^2ifxs=Weedle:y,=b2+w2xcp+w6(xcp)2

ifxs=Caterpie:y,=b3+w3⋅xcp+w7⋅(xcp)2if x_s=Caterpie: y^,= b_3+w_3\cdot x_{cp}+ w_7\cdot (x_{cp})^2ifxs=Caterpie:y,=b3+w3xcp+w7(xcp)2

ifxs=Eevee:y,=b4+w4⋅xcp+w8⋅(xcp)2if x_s=Eevee: y^, = b_4+w_4\cdot x_{cp}+ w_8\cdot (x_{cp})^2ifxs=Eevee:y,=b4+w4xcp+w8(xcp)2

↓\downarrow

y=y,+w9⋅xnp+w10⋅(xnp)2+w1⋅y = y^,+w_9\cdot x_{np}+w_10\cdot(x_{np})^2+w_1\cdoty=y,+w9xnp+w10(xnp)2+w1 xh+w12⋅(xh)2+w13⋅xw+w14⋅(xw)2x_{h}+w_12\cdot(x_{h})^2+w_13\cdot x_{w}+w_14\cdot(x_{w})^2xh+w12(xh)2+w13xw+w14(xw)2

Training Error = 1.9, Testing Error = 102.3,Overfitting
如果同时考虑宝可梦的其它属性,选一个很复杂的模型,结果会overfitting。

Back to step 2:regularization

对很多不同的test 都general有用的方法:正则化。

L(f)=L(w,b)L(f) = L(w,b)L(f)=L(w,b)
=∑n(y^n−(b+∑wi⋅xi))2+λ∑(wi)2= \sum_{n}(\hat y^n -(b+\sum{w_i\cdot x_i}))^2 + \lambda \sum(w_i)^2=n(y^n(b+wixi))2+λ(wi)2

同时会让w很小。意味着,这是一个比较平滑的function。

y=b+∑wixiy = b + \sum w_ix_iy=b+wixi

y+∑wiΔxi=b+∑wi(xi+Δxi)y + \sum w_i\Delta x_i = b + \sum w_i(x_i+\Delta x_i)y+wiΔxi=b+wi(xi+Δxi)

如果w1比较小的话,代表这个function是比较平滑的。

lambdaTrainingTesting
01.9102.3
12.368.7
103.525.7
1004.111.1
10005.612.8
100006.318.7
1000008.526.8

lambda 增加的时候,我们是会找到一个比较smooth的function。越大的λ,对training error考虑得越少。 调整λ,选择使testing error最小的λ。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值