正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。非常欢迎一起学习的伙伴们来讨论!
Week 02
2.1 Multivariate Linear Regression
2.1.1 Multiple Features
- The multivariable form of the hypothesis function :
hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxnhθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxn
=[θ0θ1⋯θn]⎡⎣⎢⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥⎥=θTx=[θ0θ1⋯θn][x0x1⋮xn]=θTx - Remark : For convenice, assume x(i)0=1for i∈1,⋯,mx0(i)=1for i∈1,⋯,m.
- The cost function J(θ)J(θ) has the same form
J(θ)=12m∑i=1m(hθ(x)−y)2J(θ)=12m∑i=1m(hθ(x)−y)2
2.1.2 Gradient Descent
- Gradient descent for mutivariate linear Regression - Algorithm 1’
Repeat {
θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i)jθj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)
(simultaneously update θjθj for j=0,⋯,nj=0,⋯,n)
}
2.1.3 Practical Tricks in GD
- Feature Scaling sisi
- Idea : Make sure features are on a similar scale. This is because θθ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
- Get every feature into approximately a −1≤xi≤1−1≤xi≤1 range (number 1 is no a necessary problem).
- Remark : The quizzes in this course use range - the programming exercises use standard deviation.
- Mean Normalization μiμi
- Replace xixi with xi−μixi−μi to make features have approximately zero mean (do no apply to x0=1x0=1).
- In general, we have :
xi:=xi−μisixi:=xi−μisi
where μiμi is the average of all the values for features(i)(i) and sisi is the range of values (max-min), or sisi is the standard deviation.
- Learning Rate Check
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ)J(θ) converge to zeor or not :
- If αα is too small, slow convergence
- If αα is too large, J(θ)J(θ) may not decrease on every iteration.
- Try to use 1×10k1×10k or 3×10k3×10k or other similar value, when judging from the plot.
- It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ)J(θ) converge to zeor or not :
2.1.4 Improvement of Linear Regression
- Feature Combination
- Combine some features in one using a variety of methods.
- Polynomial Regression
hθ(x)=θ0x0+θ1xa11+θ2xa22+⋯+θnxannhθ(x)=θ0x0+θ1x1a1+θ2x2a2+⋯+θnxnan - Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
2.2 Another Method for Normal Equation
2.2.1 Normal Equaltion
xi1≤i≤m=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢x(i)0x(i)1x(i)2⋯x(i)n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥∈Rn+1,X=⎡⎣⎢⎢⎢⎢⎢⎢⎢(x(1))T(x(2))T(x(3))T⋯(x(m))T⎤⎦⎥⎥⎥⎥⎥⎥⎥,Y=⎡⎣⎢⎢⎢⎢⎢⎢⎢y(1)y(2)y(3)⋯y(m)⎤⎦⎥⎥⎥⎥⎥⎥⎥xi1≤i≤m=[x0(i)x1(i)x2(i)⋯xn(i)]∈Rn+1,X=[(x(1))T(x(2))T(x(3))T⋯(x(m))T],Y=[y(1)y(2)y(3)⋯y(m)]
and
θ=⎡⎣⎢⎢⎢⎢⎢⎢θ0θ1θ2⋯θn⎤⎦⎥⎥⎥⎥⎥⎥θ=[θ0θ1θ2⋯θn]
Then the normal equation formula is given below :
θ=(XTX)−1XTyθ=(XTX)−1XTy
2.2.2 Comparison of GD and NE
- Gradient Descent
- Need to choose alpha and iterate
- Need learning rate
- O(kn2)O(kn2)
- Works well when nn is large.
- Normal Equation
- No need to choose alpha and iterate
- No Need to set learning rate
- Slow if nn is large