Machine Learning 学习笔记（第一周）

最新推荐文章于 2024-09-22 17:55:04 发布

chrww

最新推荐文章于 2024-09-22 17:55:04 发布

阅读量298

点赞数

分类专栏： # 机器学习文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/chrww/article/details/109860794

版权

机器学习专栏收录该内容

3 篇文章

订阅专栏

本文介绍了机器学习的基础概念，包括监督学习和无监督学习，并详细探讨了线性回归模型和成本函数。通过一周的学习笔记，讲解了线性回归的一维模型、梯度下降算法及其在批量梯度下降中的应用，同时回顾了线性代数中的矩阵、向量、乘法等基础知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Week1 Introduction

Introduction

Welcome

Machine learning is everywhere.

The aim of machine learning is to build machines as intelligent.

Two goals:

know the algorithms & math
implement each algorithmes

Why machine learing successful today

grew out of work in AI
New capability for computers

Examples:

Database mining
application can’t program by hand

handwriting recognition, NLP, CV

Self-customizing programs

Recommendations

understanding human learning (brain, real AI)

What is Machine Learning

Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning algorithms:

Supervised learning
Unsupervised learning

Others: Reinforcement learning, recommender system

Supervised Learning

Example:

housing price prediction

“right answers” given

Regression: Predict continuous valued output (price)
Breast cancer (malignant, benign)

Classification: Discrete valued output (0 or 1)

two input parameters (Age, Tumor Size) to predict (malignant, benign)

Unsupervised Learning

no label, cluster by algorithm itself

no feedback based on the prediction results

Examples:

Clustering in 2-D data
- Age, Tumor Size data
- Genes VS. individuals clustering
Organize computing clusters
Social network analysis
Market segmentation
Astronomical data analysis
cocktail party problem

More than one people speaking, distinguish the vioce from microphones

Week1 Linear Regression with One Variable

Model and Cost function

Model Representation

Linear regression

Features

supervised, regression
notation

training set

m = Number of training examples

x’s = “input” variable / features

y’s = “output” variable / “target” variable

(x, y): one traning example

$x^{(i)}, y^{(i)})$ : its training example

ML workflow

在这里插入图片描述

$h_\theta(x)=\theta_0+\theta_1x$

For historical reasons, this function h is called a hypothesis. When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

机器学习学到的是 $h$ ，也即某种假设。这种假设（数学上为各种参数 $\theta$ ）指的是输入数据和输出结果间的关系，就是说根据输入（ $X$ ），我们能得出什么样的结果（ $y$ ）。

Cost Function

Using linear regression as example:

$h_\theta(x)=\theta_0+\theta_1x$

Idea: Choose $\theta_0, \theta_1$ so that $h_\theta(x)$ is close to $y$ for our training examples $(x, y)$

We want to $\underset{\theta_0,\theta_1}{\text{minimize}} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$ ,

the cost function $J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$

which is half of squared error function, the $\frac{1}{2}$ is convenience for computation of the gradient descent.

Cost Function Intuition

Hypothesis:

$h_\theta(x)=\theta_0+\theta_1x$

a function of x

Parameters:

$\theta_0,\theta_1$
Cost Function:

$J(\theta_0,\theta_1) = \frac{1}{2m} = \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$

a function of $\theta$
Goal:

$\underset{\theta_0,\theta_1}{\text{minimize}} J(\theta_0,\theta_1)$

contour plot for two parameters and quadratic function for one parameters

Parameter Learning

Gradient Descent

Have some function $J(\theta)$

Want $\underset{\theta}{\text{min}}J(\theta)$

Outline:

Start with some $\theta$
Keep changing $\theta$ to reduce $J(\theta)$ until we hopefully end up at a minimum

Gradient descent algorithm

repeat unitl convergence {
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1)$
}

$\alpha$ is learning rate

Correct: Simultaneous update

$\theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1)$

$\theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1)$

$\theta_0 := temp0$

$\theta_1 := temp1$

Incorrect: Simultaneous update

$\theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1)$

$\theta_0:=temp0$

$\theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1)$

$\theta_1:=temp1$

在不正确的形式中， $\theta_0$ 和 $\theta_1$ 不是同时更新的（更新 $\theta_1$ 之前 $\theta_0$ 已经改变了）

Gradient Descent Intuition

不断逼近局部最小值。

如果学习率过小则会导致逼近过程过慢，如果学习率过大则无法收敛。

Gradient Descent For Linear Regression

repeat until convergence {
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1)$
}

$h_\theta(x) = \theta_0 + \theta_1 x$

$J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$

$\begin{aligned} \frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1) &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 \\ &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 \end{aligned}$

$\begin{aligned} \theta_0 &:= \theta_0 - \frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1) \\ & := \theta_0 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) \end{aligned}$

$\begin{aligned} \theta_1 &:= \theta_1 - \frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1) \\ & := \theta_1 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} \end{aligned}$

Cost function $J$ for linear regression is convex function (凸函数), have no local optimum but a global optimum.

“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training examples.

Week1 Linear Algebra Review

Linear Algebra Review

Matrices and Vectors

matrices

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \in \R^{2 \times 3}$

$A_{12}=2$

维度是先行数，再列数
Vector

is An $n\times1$ matrix

Addition and Scalar

Matrix Addition

$\begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} + \begin{bmatrix} 4 & 0.5 \\ 2 & 5 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 0.5 \\ 4 & 10 \\ 3 & 2 \end{bmatrix}$

size相同的matrix可以相加
Scalar Multiplication

$\times \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 0 \\ 6 & 15 \\ 9 & 3 \end{bmatrix}$

$\begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} / 4 = \frac{1}{4} \begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \frac{3}{2} & \frac{3}{4} \end{bmatrix}$
Combination of Operands

$\times \begin{bmatrix} 1 \\ 4 \\ 2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 5 \end{bmatrix} - \begin{bmatrix} 3 \\ 0 \\ 2 \end{bmatrix} / 2 = \begin{bmatrix} 2 \\ 12 \\ 10\frac{1}{3} \end{bmatrix}$

Matrix Vector Multiplication

$\begin{bmatrix} 1 & 3 \\ 4 & 0 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 5 \end{bmatrix} = \begin{bmatrix} 16 & 4 & 7 \end{bmatrix}$

$\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} \begin{bmatrix} g \\ h \end{bmatrix} = \begin{bmatrix} a \times g + b \times h \\ c \times g + d \times h \\ e \times g + f \times h \end{bmatrix}$

Matrix Matrix Multiplication

$\begin{bmatrix} 1 & 3 & 2 \\ 4 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 0 & 1 \\ 5 & 2 \end{bmatrix} = \begin{bmatrix} 11 & 9 \\ 10 & 14 \end{bmatrix}$

$\begin{bmatrix} a & b & c \\ d & e & f \end{bmatrix} \begin{bmatrix} g & h \\ i & j \\ k & l \end{bmatrix} = \begin{bmatrix} a \times g + b \times i + c \times k & a \times h + b \times j + c \times l \\ d\times g + e \times i + f \times k & d \times h + e \times j + f \times l \end{bmatrix}$

Matrix Multiplication Properties

Commutative

$\times B \neq B \times A$ (not commutative)
Associative

$\times B) \times C = A \times (B \times C)$
Identity Matrix

$I_{n\times n}=\begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}$

For any matrix $A$ , $A_{m \times n} I_{n \times n} = I_{m \times m} A_{m \times n} = A_{m \times n}$