正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。
Week 03
3.1 Classification Problem
3.1.1 Hypothesis representation
The form for hypothesis function in classification problem is :
The function is called “Logistic Function” or “Sigmoid Function”.
An intuitive feeling of the logistic function : y=11+e−xy=11+e−x
3.1.2 Decision boundary
To get discrete 0, 1 classification, we have :
then we have
When parameter vector θθ has been selected, we get the decision boundary :
For a better understanding, see the following examples:
- Case 1 : Linear boundary
We have a dataset below, the hypothesis function is hθ(x)=g(θ0+θ1x1+θ2x2)hθ(x)=g(θ0+θ1x1+θ2x2), b=0.5b=0.5, a=0a=0, assume we get the parameters θ0=−3θ0=−3, θ1=1θ1=1, and θ2=−1θ2=−1.
Then −3+x1+x2≥0−3+x1+x2≥0 is the decision boundary, which is show in the image below (pink straight line).
- Case 2 : Non-linear boundary
We have a dataset below, the hypothesis function is hθ(x)=g(θ0+θ1x1+θ2x2+θ3x21+θ4x22)hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22), b=0.5b=0.5, a=0a=0, assume we get the parameters θ0=−1θ0=−1, θ1=0θ1=0, θ2=0θ2=0, θ3=1θ3=1, θ4=1θ4=1.
Then x21+x22=1x12+x22=1 is the dicision boundary, which is show in the image below (the pink circle).
3.1.3 Cost function
For a logistic regression, we have the cost :
For an intuitive understanding, see the following graphs (function prototype) of case y=1y=1 and y=0y=0 :
And the cost function is :
3.1.4 Gradient Descent
- Gradient descent for logistic regression - Algorithm 2
Repeat {
θj:=θj−α∂∂θjJ(θ)=θj−α∑i=1m(hθ(x(i))−y(i))x(i)jθj:=θj−α∂∂θjJ(θ)=θj−α∑i=1m(hθ(x(i))−y(i))xj(i)
(simultaneously update all θjθj)
}
3.1.5 Some nother methods : Advanced Optimaization
- Conjugate gradient
- BFGS
- L-BFGS
They are more sophisticated, faster ways to optimize θθ.
Use in the Octave/Matlab :
- (1) Provide a function that evaluates J(θ)J(θ) and ∂∂θjJ(θ)∂∂θjJ(θ) for a given input θθ.
For example :
function [jVal, Gradient] = costFunction(theta)
jVal = [... code to compute J(theta) ...];
gradient = [... code to compute derivative of J(theta) ...];
end
- (2) Use optimization algorithm “fminunc()” and function “optimset()”.
For example :
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
3.2 Multiclass Classification
3.2.1 Multiclass classification problem
The classification problem with more than two catagories calls muticlass classification problem.
3.2.2 One-vs-all method
Since y∈{0,1,⋯,n}y∈{0,1,⋯,n}, we can divide the problem into n+1n+1 binary classification problems.
For example, when n=2n=2, we can get three binary classification problems :
3.3 Problems and solutions in fitting
3.3.1 Underfitting and overfitting
Underfitting : or high bias, is the form of the hypothesis function maps poorly to the trend of the data.
It is usually caused by a function that is too simple or uses too few features.
Overfitting : or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.
It is usually caused by a complicated function that creates a lot of unnecessary cunrves and angles unrelated to data.
For example : (left-underfitting, middle-standar output, right-overfitting)
3.3.2 Solution
The solution to underfitting :
(1) Use more features.
(2) Increase the number of training times.
The solution to overfitting :
(1) Reduce the number of features
- Manually select which features to keep.
- Use a model selection algorithm.
(2) Regularization
- Keep all the features, but reduce the magnitude of parameter θjθj.
3.3.3 Details of the Reglarization
Key point :
Regularization works well when we have a lot of sightly useful features.
We can add suitable form of some parameters that need to be small.
A classical form :
The λλ is the regularization parameter. With this formulation, we can smooth the output of our hypothesis function to reduce overfitting.
3.3.4 Application
Regularized Linear Regression :
(1) Gradient Descent
The cost function of mutivariate linear regression is
And the regularized cost function is
So we get the regularized algorithm of gradient descent :
Repeat {
θ0:=θ0−αm∑mi=1(hθ(x(i))−y(i))x(i)0θ0:=θ0−αm∑i=1m(hθ(x(i))−y(i))x0(i)
θj:=θj(1−αλm)−αm∑mi=1(hθ(x(i))−y(i))x(i)0θj:=θj(1−αλm)−αm∑i=1m(hθ(x(i))−y(i))x0(i)
j=1,2,…,nj=1,2,…,n
}
(2) Normal Equation
The regularized form of normal equation is :
It can be prove that (XTX+λ⋅L)(XTX+λ⋅L) is invertiable.
Regularized Logistic Regression :
The cost function of logistic regression is
And the regularied cost function is
So we get the regularized algorithm of gradient descent :
Repeat {
θ0:=θ0−αm∑mi=1(hθ(x(i))−y(i))x(i)0θ0:=θ0−αm∑i=1m(hθ(x(i))−y(i))x0(i)
θj:=θj(1−αλm)−αm∑mi=1(hθ(x(i))−y(i))x(i)0θj:=θj(1−αλm)−αm∑i=1m(hθ(x(i))−y(i))x0(i)
j=1,2,…,nj=1,2,…,n
}
Similarly, when using the advanced optimization algorithm, calculate the regularized cost function J(θ)J(θ) and ∂∂θjJ(θ)∂∂θjJ(θ).