Machine Learning 03 - Logistic Regression and Regularization

最新推荐文章于 2025-08-22 11:20:25 发布

能智工人

最新推荐文章于 2025-08-22 11:20:25 发布

阅读量215

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签：机器学习人工智能

本文链接：https://blog.youkuaiyun.com/ddragon1/article/details/79252623

机器学习专栏收录该内容

7 篇文章

订阅专栏

本文深入探讨了机器学习中的分类问题，介绍了假设函数的逻辑斯蒂回归形式，并讨论了决策边界的概念。此外，还详细解释了成本函数及其梯度下降算法，并提供了正则化的解决方案来应对过拟合和欠拟合问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。

Week 03

3.1 Classification Problem

3.1.1 Hypothesis representation

The form for hypothesis function in classification problem is :

h θ (x) = g (θ T x) = 1 1 + e - θ T x

$h_{\theta }(x)=g(\theta ^{T}x)=\frac{1}{1+e^{-\theta ^{T}x}}$

The function is called “Logistic Function” or “Sigmoid Function”.

An intuitive feeling of the logistic function : $y=\frac{1}{1+e^{-x}}$

Logistic Function

3.1.2 Decision boundary

To get discrete 0, 1 classification, we have :

h θ (x) \geq b \to y = 1 h θ (x) < b \to y = 0

$h_{\theta }(x)\geq b\rightarrow y=1 \\ h_{\theta }(x)< b\rightarrow y=0$

then we have

h θ (x) = g (θ T x) = b \Leftrightarrow θ T x = a

$h_{\theta }(x)=g(\theta ^{T}x)=b\Leftrightarrow \theta ^{T}x=a$

When parameter vector $\theta$ has been selected, we get the decision boundary :

θ T x = a

$\theta ^{T}x=a$

For a better understanding, see the following examples:

Case 1 : Linear boundary

We have a dataset below, the hypothesis function is $h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2})$ , $b=0.5$ , $a=0$ , assume we get the parameters $\theta_{0}=-3$ , $\theta_{1}=1$ , and $\theta_{2}=-1$ .

Then $-3+x_{1}+x_{2}\geq 0$ is the decision boundary, which is show in the image below (pink straight line).

linear boundary

Case 2 : Non-linear boundary

We have a dataset below, the hypothesis function is $h_{\theta }(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{2}^{2})$ , $b=0.5$ , $a=0$ , assume we get the parameters $\theta_{0}=-1$ , $\theta_{1}=0$ , $\theta_{2}=0$ , $\theta_{3}=1$ , $\theta_{4}=1$ .

Then $x_{1}^{2}+x_{2}^{2}=1$ is the dicision boundary, which is show in the image below (the pink circle).

Non-linear boundary

3.1.3 Cost function

For a logistic regression, we have the cost :

C o s t (h θ (x), y) = - y l o g h θ (x) - (1 - y) l o g (1 - h θ (x))

$Cost(h_{\theta }(x),y)=-y\mathrm{log}h_{\theta }(x)-(1-y)\mathrm{log}(1-h_{\theta }(x))$

For an intuitive understanding, see the following graphs (function prototype) of case $y=1$ and $y=0$ :

log function

And the cost function is :

J (θ) = - 1 m \sum i = 1 m [y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))]

$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\mathrm{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\mathrm{log}(1-h_{\theta }(x^{(i)}))]$

3.1.4 Gradient Descent

Gradient descent for logistic regression - Algorithm 2

Repeat {

$θ j : = θ j - α \partial \partial θ j J (θ) = θ j - α \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j$ $\theta_{j}:=\theta_{j}-\alpha\frac{\partial }{\partial \theta_{j}}J(\theta) =\theta_{j}-\alpha\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
(simultaneously update all $\theta_{j}$ )
}

3.1.5 Some nother methods : Advanced Optimaization

Conjugate gradient
BFGS
L-BFGS

They are more sophisticated, faster ways to optimize $\theta$ .

Use in the Octave/Matlab :

(1) Provide a function that evaluates $J(\theta)$ and $\frac{\partial }{\partial \theta_{j}}J(\theta)$ for a given input $\theta$ .

For example :

function [jVal, Gradient] = costFunction(theta)
    jVal = [... code to compute J(theta) ...];
    gradient = [... code to compute derivative of J(theta) ...];
end

(2) Use optimization algorithm “fminunc()” and function “optimset()”.

For example :

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

3.2 Multiclass Classification

3.2.1 Multiclass classification problem

The classification problem with more than two catagories calls muticlass classification problem.

comparison

3.2.2 One-vs-all method

Since $y \in \left \{ 0, 1, \cdots , n \right \}$ , we can divide the problem into $n+1$ binary classification problems.

y \in {0, 1, \dots, n}

$y \in \left \{ 0, 1, \cdots , n \right \}$

h i θ (x) = P (y = i | x; θ) i = 0, 1, \dots, n

$h_{\theta }^{i}(x)=P(y=i|x;\theta) \; i=0, 1, \cdots , n$

p r e d i c t i o n = m a x i (h i θ (x))

$prediction = \underset{i}{max}(h_{\theta }^{i}(x))$

For example, when $n=2$ , we can get three binary classification problems :

one-vs-all

3.3 Problems and solutions in fitting

3.3.1 Underfitting and overfitting

Underfitting : or high bias, is the form of the hypothesis function maps poorly to the trend of the data.

It is usually caused by a function that is too simple or uses too few features.

Overfitting : or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

It is usually caused by a complicated function that creates a lot of unnecessary cunrves and angles unrelated to data.

For example : (left-underfitting, middle-standar output, right-overfitting)

Logistic regression

3.3.2 Solution

The solution to underfitting :

(1) Use more features.
(2) Increase the number of training times.

The solution to overfitting :

(1) Reduce the number of features

Manually select which features to keep.
Use a model selection algorithm.

(2) Regularization

Keep all the features, but reduce the magnitude of parameter $\theta_{j}$ .

3.3.3 Details of the Reglarization

Key point :

Regularization works well when we have a lot of sightly useful features.
We can add suitable form of some parameters that need to be small.

A classical form :

m i n θ 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 + λ \sum j = 1 n θ 2 j

$min_{\theta }\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}+\lambda \sum_{j=1}^{n}\theta _{j}^{2}$

The $\lambda$ is the regularization parameter. With this formulation, we can smooth the output of our hypothesis function to reduce overfitting.

3.3.4 Application

Regularized Linear Regression :

(1) Gradient Descent

The cost function of mutivariate linear regression is

J (θ) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2

$J\left ( \theta \right )=\frac{1}{2m}\sum_{i=1}^{m}\left ( h_{\theta }(x^{(i)}) -y^{(i)}\right )^{2}$

And the regularized cost function is

J (θ) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 + λ \sum j = 1 n θ 2 j

$J\left ( \theta \right )=\frac{1}{2m}\sum_{i=1}^{m}\left ( h_{\theta }(x^{(i)}) -y^{(i)}\right )^{2}+\lambda \sum_{j=1}^{n}\theta _{j}^{2}$

So we get the regularized algorithm of gradient descent :

Repeat {
$\qquad \theta _{0}:=\theta _{0}- \frac{\alpha}{m}\sum_{i=1}^{m}( h_{\theta }(x^{(i)})-y^{(i)} )x_{0}^{(i)}$
$\qquad \theta _{j}:=\theta _{j}(1-\alpha \frac{\lambda }{m})- \frac{\alpha}{m}\sum_{i=1}^{m}( h_{\theta }(x^{(i)})-y^{(i)} )x_{0}^{(i)}$
$\qquad j=1, 2, \dots, n$
}

(2) Normal Equation

The regularized form of normal equation is :

θ = (X T X + λ \cdot L) - 1 X T y

$\theta =(X^{T}X+\lambda \cdot L)^{-1}X^{T}y$

w h e r e L = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 011 ⋱ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\mathrm{where} \quad L=\begin{bmatrix} 0 & & & \\ & 1 & & \\ & & 1 & \\ & & & \ddots \\ & & & & 1 \end{bmatrix}$

It can be prove that $(X^{T}X+\lambda \cdot L)$ is invertiable.

Regularized Logistic Regression :

The cost function of logistic regression is

J (θ) = - 1 m \sum i = 1 m [y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))]

$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\mathrm{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\mathrm{log}(1-h_{\theta }(x^{(i)}))]$

And the regularied cost function is

J (θ) = - 1 m \sum i = 1 m [y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))] + λ 2 m \sum j = 1 n θ 2 j

$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\mathrm{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\mathrm{log}(1-h_{\theta }(x^{(i)}))]+\frac{\lambda}{2m} \sum_{j=1}^{n}\theta _{j}^{2}$

So we get the regularized algorithm of gradient descent :

Repeat {
$\qquad \theta _{0}:=\theta _{0}- \frac{\alpha}{m}\sum_{i=1}^{m}( h_{\theta }(x^{(i)})-y^{(i)} )x_{0}^{(i)}$
$\qquad \theta _{j}:=\theta _{j}(1-\alpha \frac{\lambda }{m})- \frac{\alpha}{m}\sum_{i=1}^{m}( h_{\theta }(x^{(i)})-y^{(i)} )x_{0}^{(i)}$
$\qquad j=1, 2, \dots, n$
}

Similarly, when using the advanced optimization algorithm, calculate the regularized cost function $J(\theta)$ and $\frac{\partial }{\partial \theta_{j}}J(\theta)$ .