Regression Problems

最新推荐文章于 2025-03-30 22:42:13 发布

AulusStrong

最新推荐文章于 2025-03-30 22:42:13 发布

阅读量475

点赞数

分类专栏： Machine Learning 学习笔记文章标签： Machine Learning note

本文链接：https://blog.youkuaiyun.com/m0_37309045/article/details/93034130

版权

回归问题属于监督学习，目标是发现输入与连续输出之间的关系，例如房价预测。本文介绍了回归问题中的平方误差成本函数、梯度下降法、多元变量成本函数、特征缩放和正则化等概念，并涉及多项式回归、逻辑回归、决策边界和过拟合与欠拟合等主题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Regression Problems

Regression Problems

Regression Problems

Supervised learning：
Given a data set that we already know the correct output is like, and want to discover the relationship between the input and the output.
supervised learning is catagorized into “regression” and “classification” problems.
- regression - function that generate continuous output
  - example：house price prediction
- classification - classify the target into catagories which is a discrete output
  - example: classifying whether a tumor is malignant or benign
Unsupervised learning:
Given a data set that doesn’t have any labels or has the same label, and want to find some structure in the data.
- clustering - group a large set of data into groups that is somehow similar or related by different variables or attributes.
  - examples:
    1. News grouping
    2. Market segmentation
- non-clustering - find struction in a chaotic environment
  - examples:
    1. Cocktail party problem - cocktail party algorithm

Cost Function(Squared error function/Mean squared error):

$\frac{1}{2m}*\sum_{\mathclap{i=1}}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2$
Either devided by 2 * m or m results in the same thetas, while 2 * m used here is for convenience when taking derivative.
We can cut this function into two parts:
$J(\theta_0,\theta_1)\\ y(j) = h(x_j) = \theta_0 + theta_1 * x_j$
At first we can set $\theta$ to 0, and set $\theta$ to certain values and will get corresponding $h (x)$ . Then we take the training set to the function to calculate $h(x_i)$ and then take the result into the function J and get a discrete $\theta_1)$ , as plotted below:

That’s when J is a univariable function. when $\theta_0$ comes in, the J function would be plotted like this:

And transfer it in a contour plot will be like this:

Plotting out our training set and select some fixed value for the two thetaes, we have hs of x. And we can point out the corresponding $J(\theta)$ . matching the hs and the Js gives a clear version of the relationship of these two functions:

Gradient descent

To minimize all kinds of functions, we use gradient decent. when saying using gradient descent, we always mean that simultaneously subtract the variable.
As intuition in picture, we can say the descent comes as baby step as step walking down the hill, and finally reaching to the bottom of that route. If you start with another position on the hill, you would be reaching another bottom, while it might be real close to the first origin.

The algorithm is do the calculation repeatidly until convergence:
$\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta_0, \theta_1,...)$

The alpha is called the learning rate, which means the step we take when updating parameters and it’s always a positive number. While i is the index number.

Pay attention to what we saying simultaneously update the parameters, it means that calculating each parameters prior to updating them. We shall do it this way:
$\text{correct:}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ ...\\ \theta_0 = temp_0\\ \theta_1 = temp_1\\ ...$
instead of doing it like this:
${\text{Incorrect:}}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ \theta_0 = temp_0\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ \theta_1 = temp_1\\ ...$
In the algorithm, alpha controlls the rapidity that the parameter reaches the minimum. Either too small or too large is not appropriate. Too small will take it too many steps to the minimum which would cost a lot of time running the algorithm, while too large could overshot the minimum and may fail to convert or even divert.

Batch Gradient Descent:

Refers that in every step of gradient descent we’re looking at all the training examples. So when computing the derivatives, we’re computing the sums of each square error

There are other versions of gradient descent that look at small subsets of the training sets at a time.

（note：’:=’ means assignment same as ‘=’ in java, ‘=’ here means equalty same as ‘==’ in java

Version of Cost Function with Multiple Variable

Hypothesis with multiple variable:

Use n to denote the number of features and $x_n$ denotes the value of nth feature. with combination with superscript m, we get the nth parameter of mth training example:
$x_j^{(i)} = \text{value of feature j in the }i^{th} \text{ training example}\\ x^{(i)} = \text{the input(features) of the }i^{th} \text{ training example}$
So the hypothesis function becomes like this:
$h(\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$
When we use vectors to describe, we assume there is a $x_0$ that equals 1, thus we get a vector that index from 0. assume $\overrightarrow{X}$ as a vector and parameters $\overrightarrow{\theta}$ as another vector, and we get two (n+1) by 1 vectors, and the function is like this:
$h(\theta) = \begin{bmatrix}\theta_0&\theta_1&\theta_2&...&\theta_n\end{bmatrix}\begin{bmatrix}x_1\\x_1\\x_2\\...\\x_n\end{bmatrix}$

最低0.47元/天解锁文章