Week1 Introduction
Introduction
Welcome
Machine learning is everywhere.
The aim of machine learning is to build machines as intelligent.
Two goals:
- know the algorithms & math
- implement each algorithmes
Why machine learing successful today
- grew out of work in AI
- New capability for computers
Examples:
- Database mining
- application can’t program by hand
handwriting recognition, NLP, CV
- Self-customizing programs
Recommendations
- understanding human learning (brain, real AI)
What is Machine Learning
- Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Machine learning algorithms:
-
Supervised learning
-
Unsupervised learning
Others: Reinforcement learning, recommender system
Supervised Learning
Example:
-
housing price prediction
“right answers” given
Regression: Predict continuous valued output (price)
-
Breast cancer (malignant, benign)
Classification: Discrete valued output (0 or 1)
two input parameters (Age, Tumor Size) to predict (malignant, benign)
Unsupervised Learning
no label, cluster by algorithm itself
no feedback based on the prediction results
Examples:
-
Clustering in 2-D data
- Age, Tumor Size data
- Genes VS. individuals clustering
-
Organize computing clusters
-
Social network analysis
-
Market segmentation
-
Astronomical data analysis
-
cocktail party problem
More than one people speaking, distinguish the vioce from microphones
Week1 Linear Regression with One Variable
Model and Cost function
Model Representation
Linear regression
-
Features
supervised, regression
-
notation
training set
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
(x, y): one traning example
( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)): its training example
ML workflow
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
For historical reasons, this function h is called a hypothesis. When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
机器学习学到的是 h h h,也即某种假设。这种假设(数学上为各种参数 θ \theta θ)指的是输入数据和输出结果间的关系,就是说根据输入( X X X),我们能得出什么样的结果( y y y)。
Cost Function
Using linear regression as example:
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
Idea: Choose θ 0 , θ 1 \theta_0, \theta_1 θ0,θ1 so that h θ ( x ) h_\theta(x) hθ(x) is close to y y y for our training examples ( x , y ) (x,y) (x,y)
We want to minimize θ 0 , θ 1 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \underset{\theta_0,\theta_1}{\text{minimize}} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 θ0,θ1minimize2m1i=1∑m(hθ(x(i))−y(i))2,
the cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
which is half of squared error function, the 1 2 \frac{1}{2} 21 is convenience for computation of the gradient descent.
Cost Function Intuition
- Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
a function of x
-
Parameters:
θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
-
Cost Function:
J ( θ 0 , θ 1 ) = 1 2 m = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} = \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1=i=1∑m(hθ(x(i))−y(i))2
a function of θ \theta θ
-
Goal:
minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1}{\text{minimize}} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)
contour plot for two parameters and quadratic function for one parameters
Parameter Learning
Gradient Descent
Have some function J ( θ ) J(\theta) J(θ)
Want min θ J ( θ ) \underset{\theta}{\text{min}}J(\theta) θminJ(θ)
Outline:
- Start with some θ \theta θ
- Keep changing θ \theta θ to reduce J ( θ ) J(\theta) J(θ) until we hopefully end up at a minimum
Gradient descent algorithm
repeat unitl convergence {
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
0
,
θ
1
)
(
for
j
=
0
and
j
=
1
)
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1)
θj:=θj−α∂θj∂J(θ0,θ1)(forj=0andj=1)
}
α \alpha α is learning rate
Correct: Simultaneous update
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
t e m p 1 : = θ 0 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) temp1:=θ0−α∂θ1∂J(θ0,θ1)
θ 0 : = t e m p 0 \theta_0 := temp0 θ0:=temp0
θ 1 : = t e m p 1 \theta_1 := temp1 θ1:=temp1
Incorrect: Simultaneous update
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
θ 0 : = t e m p 0 \theta_0:=temp0 θ0:=temp0
t e m p 1 : = θ 0 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) temp1:=θ0−α∂θ1∂J(θ0,θ1)
θ 1 : = t e m p 1 \theta_1:=temp1 θ1:=temp1
在不正确的形式中, θ 0 \theta_0 θ0和 θ 1 \theta_1 θ1不是同时更新的(更新 θ 1 \theta_1 θ1之前 θ 0 \theta_0 θ0已经改变了)
Gradient Descent Intuition
不断逼近局部最小值。
如果学习率过小则会导致逼近过程过慢,如果学习率过大则无法收敛。
Gradient Descent For Linear Regression
repeat until convergence {
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
0
,
θ
1
)
(
for
j
=
0
and
j
=
1
)
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1)
θj:=θj−α∂θj∂J(θ0,θ1)(forj=0andj=1)
}
h θ ( x ) = θ 0 + θ 1 x h_\theta(x) = \theta_0 + \theta_1 x hθ(x)=θ0+θ1x
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( θ 0 + θ 1 x ( i ) − y ( i ) ) 2 \begin{aligned} \frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1) &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 \\ &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 \end{aligned} ∂θj∂J(θ0,θ1)=∂θj∂2m1i=1∑m(hθ(x(i))−y(i))2=∂θj∂2m1i=1∑m(θ0+θ1x(i)−y(i))2
θ 0 : = θ 0 − ∂ ∂ θ 0 J ( θ 0 , θ 1 ) : = θ 0 − 1 m ( h θ ( x ( i ) ) − y ( i ) ) \begin{aligned} \theta_0 &:= \theta_0 - \frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1) \\ & := \theta_0 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) \end{aligned} θ0:=θ0−∂θ0∂J(θ0,θ1):=θ0−m1(hθ(x(i))−y(i))
θ 1 : = θ 1 − ∂ ∂ θ 1 J ( θ 0 , θ 1 ) : = θ 1 − 1 m ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \begin{aligned} \theta_1 &:= \theta_1 - \frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1) \\ & := \theta_1 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} \end{aligned} θ1:=θ1−∂θ1∂J(θ0,θ1):=θ1−m1(hθ(x(i))−y(i))x(i)
Cost function J J J for linear regression is convex function (凸函数), have no local optimum but a global optimum.
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples.
Week1 Linear Algebra Review
Linear Algebra Review
Matrices and Vectors
-
matrices
A = [ 1 2 3 4 5 6 ] ∈ R 2 × 3 A= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \in \R^{2 \times 3} A=[142536]∈R2×3
A 12 = 2 A_{12}=2 A12=2
维度是先行数,再列数
-
Vector
is An n × 1 n\times1 n×1matrix
Addition and Scalar
-
Matrix Addition
[ 1 0 2 5 3 1 ] + [ 4 0.5 2 5 0 1 ] = [ 5 0.5 4 10 3 2 ] \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} + \begin{bmatrix} 4 & 0.5 \\ 2 & 5 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 0.5 \\ 4 & 10 \\ 3 & 2 \end{bmatrix} ⎣⎡123051⎦⎤+⎣⎡4200.551⎦⎤=⎣⎡5430.5102⎦⎤
size相同的matrix可以相加
-
Scalar Multiplication
3 × [ 1 0 2 5 3 1 ] = [ 3 0 6 15 9 3 ] 3 \times \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 0 \\ 6 & 15 \\ 9 & 3 \end{bmatrix} 3×⎣⎡123051⎦⎤=⎣⎡3690153⎦⎤
[ 4 0 6 3 ] / 4 = 1 4 [ 4 0 6 3 ] = [ 1 0 3 2 3 4 ] \begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} / 4 = \frac{1}{4} \begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \frac{3}{2} & \frac{3}{4} \end{bmatrix} [4603]/4=41[4603]=[123043]
-
Combination of Operands
3 × [ 1 4 2 ] + [ 0 0 5 ] − [ 3 0 2 ] / 2 = [ 2 12 10 1 3 ] 3 \times \begin{bmatrix} 1 \\ 4 \\ 2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 5 \end{bmatrix} - \begin{bmatrix} 3 \\ 0 \\ 2 \end{bmatrix} / 2 = \begin{bmatrix} 2 \\ 12 \\ 10\frac{1}{3} \end{bmatrix} 3×⎣⎡142⎦⎤+⎣⎡005⎦⎤−⎣⎡302⎦⎤/2=⎣⎡2121031⎦⎤
Matrix Vector Multiplication
[ 1 3 4 0 2 1 ] [ 1 5 ] = [ 16 4 7 ] \begin{bmatrix} 1 & 3 \\ 4 & 0 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 5 \end{bmatrix} = \begin{bmatrix} 16 & 4 & 7 \end{bmatrix} ⎣⎡142301⎦⎤[15]=[1647]
[ a b c d e f ] [ g h ] = [ a × g + b × h c × g + d × h e × g + f × h ] \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} \begin{bmatrix} g \\ h \end{bmatrix} = \begin{bmatrix} a \times g + b \times h \\ c \times g + d \times h \\ e \times g + f \times h \end{bmatrix} ⎣⎡acebdf⎦⎤[gh]=⎣⎡a×g+b×hc×g+d×he×g+f×h⎦⎤
Matrix Matrix Multiplication
[ 1 3 2 4 0 1 ] [ 1 3 0 1 5 2 ] = [ 11 9 10 14 ] \begin{bmatrix} 1 & 3 & 2 \\ 4 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 0 & 1 \\ 5 & 2 \end{bmatrix} = \begin{bmatrix} 11 & 9 \\ 10 & 14 \end{bmatrix} [143021]⎣⎡105312⎦⎤=[1110914]
[ a b c d e f ] [ g h i j k l ] = [ a × g + b × i + c × k a × h + b × j + c × l d × g + e × i + f × k d × h + e × j + f × l ] \begin{bmatrix} a & b & c \\ d & e & f \end{bmatrix} \begin{bmatrix} g & h \\ i & j \\ k & l \end{bmatrix} = \begin{bmatrix} a \times g + b \times i + c \times k & a \times h + b \times j + c \times l \\ d\times g + e \times i + f \times k & d \times h + e \times j + f \times l \end{bmatrix} [adbecf]⎣⎡gikhjl⎦⎤=[a×g+b×i+c×kd×g+e×i+f×ka×h+b×j+c×ld×h+e×j+f×l]
Matrix Multiplication Properties
-
Commutative
A × B ≠ B × A A \times B \neq B \times A A×B=B×A (not commutative)
-
Associative
( A × B ) × C = A × ( B × C ) (A \times B) \times C = A \times (B \times C) (A×B)×C=A×(B×C)
-
Identity Matrix
I n × n = [ 1 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ] I_{n\times n}=\begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} In×n=⎣⎢⎢⎢⎡10⋮001⋮0⋯⋯⋱⋯00⋮1⎦⎥⎥⎥⎤
For any matrix A A A, A m × n I n × n = I m × m A m × n = A m × n A_{m \times n} I_{n \times n} = I_{m \times m} A_{m \times n} = A_{m \times n} Am×nIn×n=Im×mAm×n=Am×n
Matrix Inverse and Transpose
-
Matrix Inverse
If A is an m × m m \times m m×mmatrix, and if it has an inverse,
A A − 1 = A − 1 A = I AA^{-1}=A^{-1}A=I AA−1=A−1A=I
-
Matrix Transpose