Coursera Machine Learning —— Week 2

Foreword

原来想做完Week2的作业之后再来写的,但是发现我不会做……所以边写边复习了。好久没有这种被作业支配的烦躁感了,啊啊啊啊啊。
\quad
Coursera Machine Learning —— Week1

MATLAB Online

Andrew推荐了两款软件,Octave和MATLAB,Octave开源免费,MATLAB要钱,不过可以使用MATLAB Online Licenses免费完成这门课的作业,于是我选择了MATLAB Online来做作业。

怎么在MATLAB Online上做作业并且提交可以看「Rose Island」Coursera Machine Learning如何提交MATLAB Online作业

Multiple Features

The multivariable form of the hypothesis function is as follows:
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n hθ(x)=θ0+θ1x1+θ2x2++θnxn
For convenience of notation, define x 0 = 1 x_0 = 1 x0=1 to make the two vectors θ \theta θ and X X X match each other element-size.
h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n hθ(x)=θ0x0+θ1x1+θ2x2++θnxn
Notation:

  • x ( i ) = x^{(i)} = x(i)= input(features) of i t h i^{th} ith training example.

    It is a vector describing the feature of one training example.

  • x j ( i ) = x^{(i)}_j = xj(i)= value of feature j j j in i t h i^{th} ith training example.

    It is a real number.

  • x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) … x n ( i ) ] ∈ R n + 1 x^{(i)} = \begin{bmatrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\\…\\x_n^{(i)}\end{bmatrix} \in \mathbb{R^{n+1}} x(i)=x0(i)x1(i)x2(i)xn(i)Rn+1 x 0 ( i ) = 1 x^{(i)}_0 = 1 x0(i)=1

  • θ = [ θ 0 θ 1 θ 2 … θ n ] ∈ R n + 1 \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\…\\\theta_n\end{bmatrix} \in \mathbb{R^{n+1}} θ=θ0θ1θ2θnRn+1

So h θ ( x ) h_\theta(x) hθ(x) could be decribed by vector as follows:
h θ ( x ) = θ T x h_\theta(x) = \theta^Tx hθ(x)=θTx

Gradient descent for multiply variables

  • Hypothesis:
    h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n hθ(x)=θTx=θ0x0+θ1x1+θ2x2++θnxn

  • Parameters:
    θ = [ θ 0 θ 1 θ 2 … θ n ] ∈ R n + 1 \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\…\\\theta_n\end{bmatrix} \in \mathbb{R^{n+1}} θ=θ0θ1θ2θnRn+1

  • Cost Function:
    J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 J(θ)=2m1i=1m(hθ(x(i))y(i))2

  • Gradient Descent:
    r e p e a t { θ j : = θ j − α ∂ ∂ θ j J ( θ ) } repeat\{\quad\quad \theta_j:=\theta_j - \alpha{\frac{\partial}{\partial\theta_j}J(\theta)} \quad\quad\} repeat{θj:=θjαθjJ(θ)}
    It is the same as:
    r e p e a t { θ j : = θ j − α 1 m ∑ i = 1 m ( ( θ T x ( i ) − y ( i ) ) x j ( i ) ) } x 0 ( i ) = 1 repeat\{\quad\quad \theta_j:=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}((\theta^Tx^{(i)}-y^{(i)})x^{(i)}_j) \quad\quad\}\quad x^{(i)}_0 = 1 repeat{θj:=θjαm1i=1m((θTx(i)y(i))xj(i))}x0(i)=1
    Simulaneously update θ j \theta_j θj for every j = 0 , 1 , … , n j = 0,1,…,n j=0,1,n.

Feature Scaling & Mean Normalization

Key Point: Make sure features are on a similar scale.

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

当数据集中 x 1 x_1 x1的值比 x 2 x_2 x2大几十几百倍时,在梯度下降的过程中,当 θ 1 \theta_1 θ1 θ 2 \theta_2 θ2改变同样大小时, J J J θ 1 \theta_1 θ1的方向上变化更加剧烈,contour plot更陡。

tz_feature_scaling

There are two methods to realize the quick descent of θ \theta θ: Feature Scaling and Mean Normalization.

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

  • Feature Scaling
    x i : = x i max ⁡ ( x i ) x_i := \frac{x_i}{\max(x_i)} xi:=max(xi)xi
    This method gets every feature into approximately a − 1 ≤ x j ≤ 1 -1 ≤ x_{j} ≤ 1 1xj1 range.

    e.g. As the picture shown above.

  • Mean Normalization
    x i : = x i − μ i s i x_i := \frac{x_i - \mu_i}{s_i} xi:=sixiμi
    Where μ i \mu_i μi is the average of all the values for feature (i) and s i s_i si is the range of values (max - min), or s i s_i si is the standard deviation.

    This method make features have approximately zero mean(Do not apply to x 0 = 1 x_0 = 1 x0=1)

    e.g. if x i x_i xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then, then, x i : = p r i c e − 1000 1900 x_i := \frac{price - 1000}{1900} xi:=1900price1000.

Learning Rate

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J ( θ ) J(\theta) J(θ) over the number of iterations of gradient descent. If J ( θ ) J(\theta) J(θ) ever increases, then you probably need to decrease α \alpha α.

Automatic convergence test. Declare convergence if J ( θ ) J(\theta) J(θ) decreases by less than E in one iteration, where E is some small value such as 1 0 − 3 10^{-3} 103. However in practice it’s difficult to choose this threshold value.

Summarization:

  • If α \alpha α is too small: slow convergence.
  • If α \alpha α is too large: may not decrease on every iteration and thus may not convergence.

To choose α \alpha α, try: … , 0.001 , 0.003 , 0.01 , 0.03 , 0.1 , 0.3 , 1 , … …,0.001,0.003,0.01,0.03,0.1,0.3,1,… ,0.001,0.003,0.01,0.03,0.1,0.3,1,

Polynomial Regression

We can make our hypothesis function a quadratic, cubic or square root function.

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if x 1 x_1 x1 has range 1 - 1000 then range of x 1 2 x_1^2 x12 becomes 1 - 1000000 and that of x 1 3 x_1^3 x13 becomes 1 - 1000000000.

Normal Equation

It is a method to solve for θ \theta θ analytically and feature scaling doesn’t matter here.
θ = ( X T X ) − 1 X T y \theta = (X^TX)^{-1}X^Ty θ=(XTX)1XTy
The derivation could be referred to 「zoe9698」正规方程(标准方程)法—笔记.
The code in Matlab:

theta = inv(X'*X)*X'*y

Notations:

  • m m m examples: ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))

  • n n n features

  • x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) … x n ( i ) ] ∈ R n + 1 x^{(i)} = \begin{bmatrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\\…\\x_n^{(i)}\end{bmatrix} \in \mathbb{R^{n+1}} x(i)=x0(i)x1(i)x2(i)xn(i)Rn+1

  • X = [ — — ( x ( 1 ) ) T — — — — ( x ( 2 ) ) T — — … — — ( x ( m ) ) T — — ] X=\begin{bmatrix}——(x^{(1)})^T——\\——(x^{(2)})^T——\\…\\——(x^{(m)})^T——\end{bmatrix} X=(x(1))T(x(2))T(x(m))T

  • y = [ y ( 1 ) y ( 2 ) … y ( m ) ] y = \begin{bmatrix}y^{(1)}\\y^{(2)}\\…\\y^{(m)}\end{bmatrix} y=y(1)y(2)y(m)

e.g. If x ( i ) = [ 1 x 1 ( i ) ] x^{(i)} = \begin{bmatrix}1\\x_1^{(i)}\end{bmatrix} x(i)=[1x1(i)], then, X = [ 1 x 1 ( 1 ) 1 x 1 ( 2 ) … … 1 x 1 ( m ) ] X = \begin{bmatrix}1\quad x_1^{(1)}\\1\quad x_1^{(2)}\\……\\1\quad x_1^{(m)}\\\end{bmatrix} X=1x1(1)1x1(2)1x1(m)

Comparsion of gradient descent and normal equation:

Gradient DescentNormal Equation
Need to choose α \alpha αNo need to choose α \alpha α
Need many iterationsNo need to iterate
Works well when n is largeSlow if n is large beacuse of ( X T X ) ( − 1 ) (X^TX)^{(-1)} (XTX)(1)
Complexity O ( k n 2 ) O(kn^2) O(kn2)Complexity O ( n 3 ) O(n^3) O(n3)

Normal equation could be useful only when X T X X^TX XTX is invertible, otherwise, the common causes might be having:

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent).
  • Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”.

Basic Codes in Matlab

v = 1:0.1:3;
a = ones(2,3);
b = zero(2,3);
c = eye(4);			% Identity matrix
d = rand(1,3);		% 3 random numbers between 0 and 1
e = size(a,2);		% return the 2-dim (column) number of a
f = length(a);		% return the column number of a
g = a(2,:);			% get every element along that row or column
h = a + ones(length(a),1);
[val,ind] = max(a);
[r,c] = find(a < 3);
i = magic(3);		% all of their rows, columns and diagonals sum up to the same
j = sum(a);
k = prod(a);		% multiplication
l = floor(a);		% rounds down
m = ceil(a);		% rounds up
n = round(a);		% rounds off
o = flipud(eye(9));	% permutation

Conclusion

全部写完了之后发现还是不太会做……罢了……

终于是坎坎坷坷地做完了作业!
tz_homework

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值