Note on Machine Learning by Andrew Ng （吴恩达机器学习笔记英文版）

最新推荐文章于 2021-08-10 19:49:49 发布

置顶我是全宇宙ENERGE的总量

最新推荐文章于 2021-08-10 19:49:49 发布

阅读量1.1k

点赞数 2

CC 4.0 BY-SA版权

分类专栏：笔记机器学习

本文链接：https://blog.youkuaiyun.com/weixin_43038346/article/details/94580091

机器学习同时被 2 个专栏收录

11 篇文章

订阅专栏

笔记

7 篇文章

订阅专栏

这是关于吴恩达机器学习课程的详细笔记，涵盖了监督学习、非监督学习、线性回归等内容。笔记中详细解释了如何使用梯度下降算法最小化代价函数，并通过实例演示了如何在Python中实现。

Machine Learning By Andrew Ng （吴恩达机器学习笔记英文版）

这是我记录的吴恩达机器学习的课堂笔记。但是我记录的是英文版，所以如果有什么地方记录不准确，希望大家指出。另外，我也见到了中文版的笔记，但是只有第一章。学习了中文版后，我打算在学习ML之余，结合我自己的理解，不断更新这个笔记。
有人觉得ML很难，但是我要说明一点。作为XDU的大一学生，我在第一年的结束时，高数总评93期末98，线代总评96期末93分。在前三章的学习中，如果对这两门掌握很好，学起来并不是非常费劲的。

Machine Learning 101

Supervised Learning
Unsupervised Learning

Clustering algorithm
Cocktail party algorithm

Reguassion and Classification Problem

Reguassion vs Classification

Instance:

Reguassion predicts housing price.

Classification predicts discrete output like if have a tumor.

Octave for ML, or Matlab(java or c++ and python requires tons of code to do the same thing!)

Model Discription

m = number of training examples

x = input variable / feature

y = output or target variable

(x, y) = one training example

( $x^{(i)}, y^{(i)}$ ) = the $i^{th}$ training example # $i$ stands for index

Linear Regression

the simple one

$[外链图片转存失败(img-DUj2v4YY-1562145993382)(C:\Users\chenh\Desktop\Notebook\Machine Learing\2019-6-30.png)]$

Hypothesis: $h_{\theta}(x) = \theta_0 + \theta_1x$

$\theta_i$ : Parameter

Univariate Cost Function

–How to choose different patameters, AKA $\theta_0, \theta_1$ ?

We want the $h_{\theta}$ fit the data (close to $y$ ) well, so we have to find the ideal $\theta_0, \theta_1$ , that make sure for every $x^{(i)}$ in data example set:

$\rightarrow h_\theta(x) - y$ small enough,

$\rightarrow{(h_\theta(x) - y)}^2$ small enough,

finally,

$\rightarrow\frac{1}{2m}\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2$ small enough.

To be clear, we define that latest function as :

$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2$

Now the only thing we need to do is find the minimum of $J(\theta_0,\theta_1) $, so we call $J(\theta_0,\theta_1) $ as cost function or squared error function.

Now, for better understanding, we simplified $h_\theta(x) = \theta_1 x$ , and our goal is minimize $\theta_1)$

$h_\theta(x)$ vs $J(\theta_1)$

在这里插入图片描述

Now we find the perfect $\theta_1 = 1$

Back to $J(\theta_0,\theta_1)$ , whatever $J(\theta_0,\theta_1)$ or $J(\theta_0)$ we can plot a bowl shaped picture. But the difference is the dimension. To show $J(\theta_0,\theta_1)$ , we don’t have to draw 3D pic, so we use contour plots(等高线) to dipict $J(\theta_0,\theta_1)$ .

We want a software(algorithm) to find the ideal $\theta_0,\theta_1$ automatically.

Gradient descent

for minimizing function $J$ and so on.

Idea

Start with some $\theta_0, \theta_1$ .
Keep changing it to reduce $J(\theta_0, \theta_1)$ until we end up at a minimum.

Gradient descent algorithm

repeat until convergence {

$\theta_j := \theta_j - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_j}$ (for $j = 0$ and $j = 1$ )

}

$\alpha$ : learning rate(how big thee step is)

This is an update equation, simultaneously update $\theta_0$ and $\theta_1$ , which like the followings.

Corrent Update

$\theta_0 - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_0}$

$\theta_1 - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_1}$

$\theta_0 := temp0$

$\theta_1 := temp1$

Incorrent (does not refer to gradient descent algorithm)

$\theta_0 - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_0}$

$\theta_0 := temp0$

$\theta_1 - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_1}$ (new $\theta_0$ in this step)

$\theta_1 = temp1$

So, we have to define Variables like temp* to collect all the $\theta$ ’ s.

The partial derivative term in the equation

在这里插入图片描述

The $\alpha$ term in the equation

If the $\alpha$ is too small, the progress is slow.
If it is too big, we may overshoot (miss) the minimum. It may fail to converge, or even diverge.

Q: What if your parameter $\theta_1$ is already at the local minimum, what do you think one step of gradient descent (algorithm) will do?

A: $\theta_1$ won’t change because the derivative term is equal to zero, that will keep $\theta_1$ at the local optimum. (It is actually basic calculus. )

Gradient descent can converge to a local minimum, even with the learning rate $\alpha$ fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time.

You can use gradient descent algorithm to minimize any cost function $J$ , not only defined for linear regression.

Gradient descent for linear regerssion

review

Linear Regression Model

$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2$

$h_\theta(x) = \theta_0 + \theta_1 x$
Gradient descent algorithm
repeat until convergence {

$\theta_j := \theta_j - \alpha \frac{\partial J(\theta_0, \theta_1) }{\partial \theta_j}$ (for $j = 0$ and $j = 1$ )

}

Apply gradient descent to minimize squared error cost function. The key to write this code is the derivative term.

So we write it down.
在这里插入图片描述

So we have gradient descent algorithm

repeat until convergence{

$\theta_0 := \theta_0-\alpha\frac{1}{m} \sum_{i=1}^m{(h_{\theta}(x^{(i)})-y^{(i)})}$

$\theta_1 := \theta_1-\alpha\frac{1}{m} \sum_{i=1}^m{(h_{\theta}(x^{(i)})-y^{(i)})*x^{(i)}}$

}

We always have a bowl shaped plot in using linear regerssion, technically is convex function(凸函数). So, instead of having many local optimum, it only has one global optimum.