Machine Learning Notes

最新推荐文章于 2025-08-22 11:20:25 发布

v5wcm

最新推荐文章于 2025-08-22 11:20:25 发布

阅读量397

点赞数 1

CC 4.0 BY-SA版权

分类专栏： ML 文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/v5wcm/article/details/50589199

ML 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了机器学习的基本概念，包括监督学习与非监督学习的区别，线性回归的数学模型及梯度下降算法的应用。同时探讨了多变量线性回归、特征缩放、正规方程等关键主题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# Machine Learning

标签（空格分隔）： ML

Introduction

Supervised Learning

Supervised Learning: “right answers” given
Classification: Discrete valued output (0 or 1)
Regression: Predict continuous valued output

Unsupervised Learning

Model and Cost Function

Linear regression with one variable
Univariate linear regression
Hypothesis:

h θ (x) = θ 0 + θ 1 (x)

$h_{\theta }(x)=\theta _{0}+\theta _{1}(x)$
Parameters:

θ 0, θ 1

$\theta _{0},\theta _{1}$
Idea: Choose

θ0,θ1 $\theta _{0},\theta _{1}$ so that

hθ(x) $h_{\theta }(x)$ is close to

y $y$ for our training examples

(x,y) $(x,y)$
Cose Function:

min J (θ 0, θ 1) = 1 2 m \sum i = 1 m (h θ (x i) - y i) 2

$\min J(\theta _{0},\theta _{1}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta }(x^{i})-y^{i})^{2}$
We called that square error function.

hθ(x) $h_{\theta }(x)$ (for fixed

θ0,θ1 $\theta _{0},\theta _{1}$ , this is a function of

x $x$ )

这里写图片描述

Matrix

Dimension of matrix: number of rows $\times$ number of columns. $\mathbb{R}^{m \times n}$

Gradient descent algorithm

repeat until convergence

θ j : = θ j - α \partial J ( θ 0 , θ 1 ) \partial θ j f o r (j = 0 a n d j = 1)

$\theta _{j}:=\theta _{j}-\alpha \frac{\partial J(\theta _{0},\theta _{1})}{\partial \theta _{j}}\; \;\; for(j=0 \, and\, j=1)$
:= is assignment, and

α $\alpha$ is learning rate.
the subtlety of how you implement gradient descent

Correct: simultaneous update:
$temp0:=\theta _{0}-\alpha \frac{\partial J(\theta _{0},\theta _{1})}{\partial \theta _{0}}$
$temp1:=\theta _{1}-\alpha \frac{\partial J(\theta _{0},\theta _{1})}{\partial \theta _{1}}$
$\theta _{0}=temp0$
$\theta _{1}=temp1$
Gradient descent can converge to a local minimum, even with the learning rate in a fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease &\alpha$ over time.

WEEK 2

Linear Regression with multiple variables: Multiple Features

$x^{(i)}$ =input (features) of $i^{(th)}$ training example.
$x_{j}^{i}$ =value if feature $j$ in $i^{th}$ training example.

Hypothesis

$h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\cdots +\theta_{n}x_{n}$
For convenience of notation, define $x_{0}=1$

x = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x 0 x 1 x 2 \dots x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 θ 2 \dots θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$x=\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2}\\ \dots\\ x_{n} \end{bmatrix} \: \: \: \: \: \: \: \: \: \: \: \: \theta =\begin{bmatrix} \theta _{0}\\ \theta _{1}\\ \theta _{2}\\ \dots\\ \theta _{n} \end{bmatrix}$

h θ (x) = θ T x

$h_{\theta}(x)=\theta^{T}x$

Gradient Descent for Multiple Variables

Hypothesis: $h_{\theta}(x)=\theta^{T}=\theta_{0}x_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\cdots +\theta_{n}x_{n}$
Parameters: $\theta_{0},\theta_{1},\dots,\theta_{n}$
Cost function:

J (θ 0, θ 1, \dots, θ n) = J (θ) = 1 2 m \sum i = 1 m (h θ (x i) - y i) 2

$J(\theta _{0},\theta _{1},\dots,\theta_{n}) =J(\theta)= \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta }(x^{i})-y^{i})^{2}$
Gradient Descent
repeat until convergence

θ j : = θ j - α \partial J ( θ 0 , θ 1 ) \partial θ j f o r (j = 0, 1, \dots, n)

$\theta _{j}:=\theta _{j}-\alpha \frac{\partial J(\theta _{0},\theta _{1})}{\partial \theta _{j}}\; \;\; for(j=0 ,1,\dots,n)$
这里写图片描述

Feature Scaling

Mean Normalization

Learning rate

“Debugging”: How to make sure gradient descent is working correctly.
How to choose learning rate $\alpha$

Housing prices prediction

h θ (x) = θ 0 + θ 1 \times f r o n t a g e + θ 2 \times d e p t h

$h_{\theta}(x)=\theta_{0}+\theta_{1}\times frontage+\theta_{2}\times depth$
define new features

h θ (x) = θ 0 + θ 1 \times A r e a

$h_{\theta}(x)=\theta_{0}+\theta_{1}\times Area$

Polynomial regression

这里写图片描述

Choice of features

h θ (x) = θ 0 + θ 1 \times (s i z e) + θ 2 \times (s i z e) 2

$h_{\theta}(x)=\theta_{0}+\theta_{1}\times(size)+\theta_{2}\times (size)^{2}$

h θ (x) = θ 0 + θ 1 \times (s i z e) + θ 2 \times s i z e - - - - \sqrt

$h_{\theta}(x)=\theta_{0}+\theta_{1}\times(size)+\theta_{2}\times \sqrt {size}$

Computing Parameters Analytically

Normal Equation

Method to solve for $\theta$ analytically.

θ = (X T X) - 1 X T y

$\theta = (X^{T}X)^{-1}X^{T}y$
+ No need to chhose

α $\alpha$
+ Don’t need to iterate.
+ Need to compute

(XTX)−1 $(X^{T}X)^{-1}$
+ Slow if

n $n$ is very large.

Octave

Moving Data Around

    load('featuresX.dat')
    load featuresX.dat
    who %veriables in the current scope
    whos
    clear
    save hello.mat v
    save hello.txt v %save as text(ASCII)
    A(2,:)  %':' means every elements along that row
    A([1 3],:)
    A = [A,[100;200;300]];
    A[:] %put all elements of A into a single vector

Computing on Data

    A.*B
    A*B
    A.^2
    v=[1;2;3]
    1 ./ v
    log(v)
    exp(v)
    abs(v)
    -v
    v + ones(length(v),1)
    a = [1 2 3 4]
    [val,ind]=max(a)
    a < 1
    find(a<3)
    A = magic(3)
    [r,c] = find(A>=7)
    sum(a)
    prod(a)
    floor(a)
    ceil(a)
    max(rand(3），rand(3))
    max(A,[],1)
    max(A,[],2)
    max(max(A))
    A.*eye(3)
    pinv(A)

Plotting Data

    plot

Vectorization

这里写图片描述

Week 3

Classification and Represstation

Classification

$y=0$ or $1$
$h_\theta(x)$ can be $>1$ or
Logistic Regression: $0 \leq h_\theta(x) \leq 1$

Hypothesis Representation

Logistic Regression Model
want $0 \leq h_\theta(x) \leq 1$

h θ (x) = g (θ T x)

$h_\theta(x)=g(\theta^{T}x)$

g (z) = 1 1 + e - z

$g(z)=\frac{1}{1+e^{-z}}$
Sigmoid function or logistic function
Interpretation of Hypothesis Output

hθ(x)= $h_\theta(x)=$ estimated probability that

y=1 $y=1$ on input

x $x$

Cost Function

convex
Logistic regression cost function

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x i), y i)

$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{i}),y^{i})$

C o s t (h θ (x), y) = {- l o g (h θ (x)) i f y = 1 - l o g (1 - h θ (x)) i f y = 0

$Cost(h_\theta(x),y)=\left\{\begin{matrix} -log(h_\theta(x))\: \: if y=1\\ -log(1-h_\theta(x))\: \: if y=0 \end{matrix}\right.$
Note:

y=0 $y=0$ or

1 $1$ always
Simplified Cost Function

C o s t (h θ (x i), y i) = - y (i) l o g (h θ (x (i))) - (1 - y (i)) l o g (1 - h θ (x (i)))

$Cost(h_{\theta}(x^{i}),y^{i})=-y^{(i)}log(h_{\theta}(x^{(i)}))-(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))$

J (θ) = - 1 m \sum i = 1 m [y (i) l o g (h θ (x (i))) + (1 - y (i)) l o g (1 - h θ (x (i)))]

$J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left [y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))\right ]$
Want

minθJ(θ): $min_{\theta}J(\theta):$
Repeat{

θ j : = θ j - α \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j

$\theta_{j}:=\theta_{j}-\alpha \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
(simultaneously update all

θj $\theta_{j}$ )
}

Algorithm looks identical to linear regression!

Regularization

Overfitting

这里写图片描述

Cost Function

J (θ) = 1 2 m ⎡ ⎣ \sum i = 1 m (h θ (x (i)) - y (i)) 2 + λ \sum j = 1 n θ 2 j ⎤ ⎦

$J(\theta)= \frac{1}{2m}\left [ \sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}+\lambda\sum_{j=1}^{n}\theta_{j}^{2} \right ]$

Gadient descent

Repeat{

θ 0 : = θ 0 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i 0

$\theta _{0}:=\theta _{0}-\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{i}$

θ j : = θ j - α [1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i j + λ m θ j]

$\theta _{j}:=\theta _{j}-\alpha \left [\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{i} +\frac{\lambda}{m}\theta _{j} \right ]$
}

θ j : = θ j (1 - α λ m) - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i j

$\theta _{j}:=\theta _{j}(1-\alpha\frac{\lambda}{m})-\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{i}$

Non-invertibility (optional/advanced)

Suppose $m \leq n,$ (#examples) (#features)

θ = (X T X) - 1 X T y

$\theta=(X^{T}X)^{-1}X^{T}y$
if

ambda>0, $\;ambda>0,$

θ = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ X T X + λ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 011 ⋱ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ - 1 X T y

$\theta =\left ( X^{T}X+\lambda\begin{bmatrix} 0 & & & & \\ & 1& & & \\ & & 1& & \\ & & & \ddots & \\ & & & & 1 \end{bmatrix} \right )^{-1}X^{T}y$

Week 4

这里写图片描述

$a_i^{(j)}=$ ”activation” of unit i in layer j
$\Theta^{(j)}=$ matrix of weights controlling function mapping from layer j to layer j+1

a (2) 1 = g (Θ (1) 10 x 0 + Θ (1) 11 x 1 + Θ (1) 12 x 2 + Θ (1) 13 x 3)

$a_{1}^{(2)}=g(\Theta_{10}^{(1)}x_{0}+\Theta_{11}^{(1)}x_{1}+\Theta_{12}^{(1)}x_{2}+\Theta_{13}^{(1)}x_{3})$

a (2) 2 = g (Θ (1) 20 x 0 + Θ (1) 21 x 1 + Θ (1) 22 x 2 + Θ (1) 23 x 3)

$a_{2}^{(2)}=g(\Theta_{20}^{(1)}x_{0}+\Theta_{21}^{(1)}x_{1}+\Theta_{22}^{(1)}x_{2}+\Theta_{23}^{(1)}x_{3})$

a (2) 3 = g (Θ (1) 30 x 0 + Θ (1) 31 x 1 + Θ (1) 32 x 2 + Θ (1) 33 x 3)

$a_{3}^{(2)}=g(\Theta_{30}^{(1)}x_{0}+\Theta_{31}^{(1)}x_{1}+\Theta_{32}^{(1)}x_{2}+\Theta_{33}^{(1)}x_{3})$

h Θ (x) = a (3) 1 = g (Θ (2) 10 a 0 + Θ (2) 11 a 1 + Θ (2) 12 a 2 + Θ (2) 13 a 3)

$h_{\Theta}(x)=a_{1}^{(3)}=g(\Theta_{10}^{(2)}a_{0}+\Theta_{11}^{(2)}a_{1}+\Theta_{12}^{(2)}a_{2}+\Theta_{13}^{(2)}a_{3})$
If network has

sj $s_j$ units in layer

j $j$ ,

sj+1 $s_{j+1}$ units in layer

j+1 $j+1$ , then

Θ(j) $\Theta^{(j)}$ will be of demension

sj+1×(sj+1) $s_{j+1}\times (s_{j}+1)$ .

z 21 = Θ (1) 10 x 0 + Θ (1) 11 x 1 + Θ (1) 12 x 2 + Θ (1) 13 x 3

$z_{1}^{2}=\Theta_{10}^{(1)}x_{0}+\Theta_{11}^{(1)}x_{1}+\Theta_{12}^{(1)}x_{2}+\Theta_{13}^{(1)}x_{3}$

Forward propagation: Vectorized implementation

x = ⎡ ⎣ ⎢ ⎢ ⎢ x 0 x 1 x 2 x 3 ⎤ ⎦ ⎥ ⎥ ⎥ z (2) = ⎡ ⎣ ⎢ ⎢ ⎢ z (2) 1 z (2) 2 z (2) 3 ⎤ ⎦ ⎥ ⎥ ⎥

$x=\begin{bmatrix} x_{0}\\ x_{1}\\ x_{2}\\ x_{3} \end{bmatrix} \: \: \: \: \: \: \: \: z^{(2)}=\begin{bmatrix} z_{1}^{(2)}\\ z_{2}^{(2)}\\ z_{3}^{(2)} \end{bmatrix}$