classification problems
——where the variable y that you want to predict is valued.
negative class-positive class
“-“-“+”
y(i)->the label for the training example
way1:
linear regression
threshold the classifier outputs at 0.5
if hypothesis >= 0.5 ->y = 1
else ->y = 0
it makes sense
however,if there is a training example way out there on the right,it doesn’t actually change anything,but this magenta line out here to this blue line over here, and caused it to give us a worse hypothesis
way2:
logistic regression
sigmoid function/logistic function
it looks like
eventally
hθ(x)= estimated probability y=1 on input x
namely, hθ(x)=P(y=1|x;θ)
at the same time,
decision boundary
obviously,
θTx≥0⇒y=1
θTx<0⇒y=0
then,with Linear programming we can have two divided areas
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function (not training examples).
Again, the input to the sigmoid function g(z) doesn’t need to be linear, and could be a function that describes a circle or any shape to fit our data.
cost function
cost function in linear regression:
J(θ)=1m∑mi=1Cost(hθ(x(i)),y(i))
Cost(hθ(x),y)=12(hθ(x)−y)2
the cost function for linear regression cannot be used, because it will be a non-convex function.
Instead,our cost function for logistic regression looks like
(writing the cost function in this way guarantees that J(θ) is convex for logistic regression):
J(θ)=1m∑mi=1Cost(hθ(x(i)),y(i))
Cost(hθ(x),y)={−loghθ(x)−log(1−hθ(x))if y=1if y=0
if we compress them,then
A vectorized implementation is:
When y = 1, the following plot for J(θ) vs hθ(x):
Similarly, when y = 0, the following plot for J(θ) vs hθ(x):
Gradient descent
want min J(\theta):
repeat{
θj:=θ−αm∑mi=1(hθ(x(i))−y(i))x(i)j
(simultaneously update all θj)
}
Notice that this algorithm is identical to the one we used in linear regression.
A vectorized implementation is:
multiclass classfication
we have y = {0,1…n}.
in this way, divide our problem into n binary classification problems
Advanced Optimization
“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent.Use the libraries.