Foreword
原来想做完Week2的作业之后再来写的,但是发现我不会做……所以边写边复习了。好久没有这种被作业支配的烦躁感了,啊啊啊啊啊。
\quad
Coursera Machine Learning —— Week1
MATLAB Online
Andrew推荐了两款软件,Octave和MATLAB,Octave开源免费,MATLAB要钱,不过可以使用MATLAB Online Licenses免费完成这门课的作业,于是我选择了MATLAB Online来做作业。
怎么在MATLAB Online上做作业并且提交可以看「Rose Island」Coursera Machine Learning如何提交MATLAB Online作业。
Multiple Features
The multivariable form of the hypothesis function is as follows:
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
+
…
+
θ
n
x
n
h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n
hθ(x)=θ0+θ1x1+θ2x2+…+θnxn
For convenience of notation, define
x
0
=
1
x_0 = 1
x0=1 to make the two vectors
θ
\theta
θ and
X
X
X match each other element-size.
h
θ
(
x
)
=
θ
0
x
0
+
θ
1
x
1
+
θ
2
x
2
+
…
+
θ
n
x
n
h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n
hθ(x)=θ0x0+θ1x1+θ2x2+…+θnxn
Notation:
-
x ( i ) = x^{(i)} = x(i)= input(features) of i t h i^{th} ith training example.
It is a vector describing the feature of one training example.
-
x j ( i ) = x^{(i)}_j = xj(i)= value of feature j j j in i t h i^{th} ith training example.
It is a real number.
-
x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) … x n ( i ) ] ∈ R n + 1 x^{(i)} = \begin{bmatrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\\…\\x_n^{(i)}\end{bmatrix} \in \mathbb{R^{n+1}} x(i)=⎣⎢⎢⎢⎢⎢⎡x0(i)x1(i)x2(i)…xn(i)⎦⎥⎥⎥⎥⎥⎤∈Rn+1, x 0 ( i ) = 1 x^{(i)}_0 = 1 x0(i)=1
-
θ = [ θ 0 θ 1 θ 2 … θ n ] ∈ R n + 1 \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\…\\\theta_n\end{bmatrix} \in \mathbb{R^{n+1}} θ=⎣⎢⎢⎢⎢⎡θ0θ1θ2…θn⎦⎥⎥⎥⎥⎤∈Rn+1
So
h
θ
(
x
)
h_\theta(x)
hθ(x) could be decribed by vector as follows:
h
θ
(
x
)
=
θ
T
x
h_\theta(x) = \theta^Tx
hθ(x)=θTx
Gradient descent for multiply variables
-
Hypothesis:
h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n hθ(x)=θTx=θ0x0+θ1x1+θ2x2+…+θnxn -
Parameters:
θ = [ θ 0 θ 1 θ 2 … θ n ] ∈ R n + 1 \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\…\\\theta_n\end{bmatrix} \in \mathbb{R^{n+1}} θ=⎣⎢⎢⎢⎢⎡θ0θ1θ2…θn⎦⎥⎥⎥⎥⎤∈Rn+1 -
Cost Function:
J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2 -
Gradient Descent:
r e p e a t { θ j : = θ j − α ∂ ∂ θ j J ( θ ) } repeat\{\quad\quad \theta_j:=\theta_j - \alpha{\frac{\partial}{\partial\theta_j}J(\theta)} \quad\quad\} repeat{θj:=θj−α∂θj∂J(θ)}
It is the same as:
r e p e a t { θ j : = θ j − α 1 m ∑ i = 1 m ( ( θ T x ( i ) − y ( i ) ) x j ( i ) ) } x 0 ( i ) = 1 repeat\{\quad\quad \theta_j:=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}((\theta^Tx^{(i)}-y^{(i)})x^{(i)}_j) \quad\quad\}\quad x^{(i)}_0 = 1 repeat{θj:=θj−αm1i=1∑m((θTx(i)−y(i))xj(i))}x0(i)=1
Simulaneously update θ j \theta_j θj for every j = 0 , 1 , … , n j = 0,1,…,n j=0,1,…,n.
Feature Scaling & Mean Normalization
Key Point: Make sure features are on a similar scale.
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
当数据集中 x 1 x_1 x1的值比 x 2 x_2 x2大几十几百倍时,在梯度下降的过程中,当 θ 1 \theta_1 θ1和 θ 2 \theta_2 θ2改变同样大小时, J J J在 θ 1 \theta_1 θ1的方向上变化更加剧烈,contour plot更陡。
There are two methods to realize the quick descent of θ \theta θ: Feature Scaling and Mean Normalization.
Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
-
Feature Scaling
x i : = x i max ( x i ) x_i := \frac{x_i}{\max(x_i)} xi:=max(xi)xi
This method gets every feature into approximately a − 1 ≤ x j ≤ 1 -1 ≤ x_{j} ≤ 1 −1≤xj≤1 range.e.g. As the picture shown above.
-
Mean Normalization
x i : = x i − μ i s i x_i := \frac{x_i - \mu_i}{s_i} xi:=sixi−μi
Where μ i \mu_i μi is the average of all the values for feature (i) and s i s_i si is the range of values (max - min), or s i s_i si is the standard deviation.This method make features have approximately zero mean(Do not apply to x 0 = 1 x_0 = 1 x0=1)
e.g. if x i x_i xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then, then, x i : = p r i c e − 1000 1900 x_i := \frac{price - 1000}{1900} xi:=1900price−1000.
Learning Rate
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J ( θ ) J(\theta) J(θ) over the number of iterations of gradient descent. If J ( θ ) J(\theta) J(θ) ever increases, then you probably need to decrease α \alpha α.
Automatic convergence test. Declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than E in one iteration, where E is some small value such as 1 0 − 3 10^{-3} 10−3. However in practice it’s difficult to choose this threshold value.
Summarization:
- If α \alpha α is too small: slow convergence.
- If α \alpha α is too large: may not decrease on every iteration and thus may not convergence.
To choose α \alpha α, try: … , 0.001 , 0.003 , 0.01 , 0.03 , 0.1 , 0.3 , 1 , … …,0.001,0.003,0.01,0.03,0.1,0.3,1,… …,0.001,0.003,0.01,0.03,0.1,0.3,1,…
Polynomial Regression
We can make our hypothesis function a quadratic, cubic or square root function.
One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
eg. if x 1 x_1 x1 has range 1 - 1000 then range of x 1 2 x_1^2 x12 becomes 1 - 1000000 and that of x 1 3 x_1^3 x13 becomes 1 - 1000000000.
Normal Equation
It is a method to solve for
θ
\theta
θ analytically and feature scaling doesn’t matter here.
θ
=
(
X
T
X
)
−
1
X
T
y
\theta = (X^TX)^{-1}X^Ty
θ=(XTX)−1XTy
The derivation could be referred to 「zoe9698」正规方程(标准方程)法—笔记.
The code in Matlab:
theta = inv(X'*X)*X'*y
Notations:
-
m m m examples: ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),…,(x(m),y(m))
-
n n n features
-
x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) … x n ( i ) ] ∈ R n + 1 x^{(i)} = \begin{bmatrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\\…\\x_n^{(i)}\end{bmatrix} \in \mathbb{R^{n+1}} x(i)=⎣⎢⎢⎢⎢⎢⎡x0(i)x1(i)x2(i)…xn(i)⎦⎥⎥⎥⎥⎥⎤∈Rn+1
-
X = [ — — ( x ( 1 ) ) T — — — — ( x ( 2 ) ) T — — … — — ( x ( m ) ) T — — ] X=\begin{bmatrix}——(x^{(1)})^T——\\——(x^{(2)})^T——\\…\\——(x^{(m)})^T——\end{bmatrix} X=⎣⎢⎢⎡——(x(1))T————(x(2))T——…——(x(m))T——⎦⎥⎥⎤
-
y = [ y ( 1 ) y ( 2 ) … y ( m ) ] y = \begin{bmatrix}y^{(1)}\\y^{(2)}\\…\\y^{(m)}\end{bmatrix} y=⎣⎢⎢⎡y(1)y(2)…y(m)⎦⎥⎥⎤
e.g. If x ( i ) = [ 1 x 1 ( i ) ] x^{(i)} = \begin{bmatrix}1\\x_1^{(i)}\end{bmatrix} x(i)=[1x1(i)], then, X = [ 1 x 1 ( 1 ) 1 x 1 ( 2 ) … … 1 x 1 ( m ) ] X = \begin{bmatrix}1\quad x_1^{(1)}\\1\quad x_1^{(2)}\\……\\1\quad x_1^{(m)}\\\end{bmatrix} X=⎣⎢⎢⎢⎡1x1(1)1x1(2)……1x1(m)⎦⎥⎥⎥⎤
Comparsion of gradient descent and normal equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose α \alpha α | No need to choose α \alpha α |
Need many iterations | No need to iterate |
Works well when n is large | Slow if n is large beacuse of ( X T X ) ( − 1 ) (X^TX)^{(-1)} (XTX)(−1) |
Complexity O ( k n 2 ) O(kn^2) O(kn2) | Complexity O ( n 3 ) O(n^3) O(n3) |
Normal equation could be useful only when X T X X^TX XTX is invertible, otherwise, the common causes might be having:
- Redundant features, where two features are very closely related (i.e. they are linearly dependent).
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”.
Basic Codes in Matlab
v = 1:0.1:3;
a = ones(2,3);
b = zero(2,3);
c = eye(4); % Identity matrix
d = rand(1,3); % 3 random numbers between 0 and 1
e = size(a,2); % return the 2-dim (column) number of a
f = length(a); % return the column number of a
g = a(2,:); % get every element along that row or column
h = a + ones(length(a),1);
[val,ind] = max(a);
[r,c] = find(a < 3);
i = magic(3); % all of their rows, columns and diagonals sum up to the same
j = sum(a);
k = prod(a); % multiplication
l = floor(a); % rounds down
m = ceil(a); % rounds up
n = round(a); % rounds off
o = flipud(eye(9)); % permutation
Conclusion
全部写完了之后发现还是不太会做……罢了……
终于是坎坎坷坷地做完了作业!