最近开始学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。
Week 01
Introduction
- Application of machine learing
- Database mining
- Applications can’t program by hand
- Customizing programs
- Definition
- Arthur Samuen. Field of study that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell. A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience E.
- Common Types
- Supervised Learning
- Given the “right answer” for each example in the data.
- Regreession Problem : Predict real-value output.
- Classification : Predict discrete output.
- Unsupercised learning
- Unsupervised learning allows us to approach problems with little or no idea what the effect of the variables is.
- Clustering、Non-clustering
- Supervised Learning
Model and Cost Function
- Basic Model Representation
- number of training data - m
- Input - x
- Output - y
- Input space - X
- Output space - Y
- Hypothesis - h:X->Y
- Cost Function
- Cost function measure the accuracy of our hypothesis function, for example :
J(θ0,θ1)=12m∑i=1m(yi^−yi)2=12m∑i=1m(hθ(xi)−yi)2J(θ0,θ1)=12m∑i=1m(yi^−yi)2=12m∑i=1m(hθ(xi)−yi)2
- Cost function measure the accuracy of our hypothesis function, for example :
- Contour plot
- A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value of the same line.
Parameter Learning
Outline
- Start with some θ0,θ1θ0,θ1
- Keep changing θ0,θ1θ0,θ1 to reduce J(θ0,θ1)J(θ0,θ1) until we hopefully end up at a minimum.
Algorithm
repeat until convergence {
θj:=θj−α∂∂θ0J(θ0,θ1,...,θn)(for j = 0, 1, ..., n)θj:=θj−α∂∂θ0J(θ0,θ1,...,θn)(for j = 0, 1, ..., n)
}
simultaneous update {
temp0:=θ0−α∂∂θ0J(θ0,θ1,...,θn)θ0−α∂∂θ0J(θ0,θ1,...,θn)
…
tempn:=θn−α∂∂θ0J(θ0,θ1,...,θn)θn−α∂∂θ0J(θ0,θ1,...,θn)
θ0θ0:=temp0
…
θnθn:=tempn
}- Comprehension
- It’s like going down the hill in the fastest way. The differential give us a direction to move towards, and the αα(which is called learning rate) means the size of each step.
- As we approach the bottom of our convex function, the derivative will tend to be 0, and at the bottom we have θ1:=θ1−α×0θ1:=θ1−α×0.
- Comprehension
Gradient descent for linear regression - Algorthm1
repeat until convergence (simultaneously update){
θ0:=θ0−α1m∑i=1m(hθ(x(i))−yi)θ0:=θ0−α1m∑mi=1(hθ(x(i))−yi)
θ1:=θ1−α1m∑i=1m(hθ(x(i))−yi)θ1:=θ1−α1m∑mi=1(hθ(x(i))−yi)
}- This method looks at every example in the entire training set on every step, and is called batch gradient descent.