Regression Problems

回归问题属于监督学习,目标是发现输入与连续输出之间的关系,例如房价预测。本文介绍了回归问题中的平方误差成本函数、梯度下降法、多元变量成本函数、特征缩放和正则化等概念,并涉及多项式回归、逻辑回归、决策边界和过拟合与欠拟合等主题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Regression Problems

  • Supervised learning
    Given a data set that we already know the correct output is like, and want to discover the relationship between the input and the output.
    supervised learning is catagorized into “regression” and “classification” problems.

    • regression - function that generate continuous output
      • example:house price prediction
    • classification - classify the target into catagories which is a discrete output
      • example: classifying whether a tumor is malignant or benign
  • Unsupervised learning:
    Given a data set that doesn’t have any labels or has the same label, and want to find some structure in the data.

    • clustering - group a large set of data into groups that is somehow similar or related by different variables or attributes.

      • examples:
        1. News grouping
        2. Market segmentation
    • non-clustering - find struction in a chaotic environment

      • examples:
        1. Cocktail party problem - cocktail party algorithm

Cost Function(Squared error function/Mean squared error):

J = 1 2 m ∗ ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J = \frac{1}{2m}*\sum_{\mathclap{i=1}}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2 J=2m1i=1n(hθ(x(i))y(i))2
Either devided by 2 * m or m results in the same thetas, while 2 * m used here is for convenience when taking derivative.
We can cut this function into two parts:
J ( θ 0 , θ 1 ) y ( j ) = h ( x j ) = θ 0 + t h e t a 1 ∗ x j J(\theta_0,\theta_1)\\ y(j) = h(x_j) = \theta_0 + theta_1 * x_j J(θ0,θ1)y(j)=h(xj)=θ0+theta1xj
At first we can set θ \theta θ to 0, and set θ \theta θ to certain values and will get corresponding h ( x ) h(x) h(x). Then we take the training set to the function to calculate h ( x i ) h(x_i) h(xi) and then take the result into the function J and get a discrete J ( 0 , θ 1 ) J(0, \theta_1) J(0,θ1), as plotted below:
image
image
That’s when J is a univariable function. when θ 0 \theta_0 θ0 comes in, the J function would be plotted like this:
image

And transfer it in a contour plot will be like this:
image

Plotting out our training set and select some fixed value for the two thetaes, we have hs of x. And we can point out the corresponding J ( θ ) J(\theta) J(θ). matching the hs and the Js gives a clear version of the relationship of these two functions:
image
image

Gradient descent

To minimize all kinds of functions, we use gradient decent. when saying using gradient descent, we always mean that simultaneously subtract the variable.
As intuition in picture, we can say the descent comes as baby step as step walking down the hill, and finally reaching to the bottom of that route. If you start with another position on the hill, you would be reaching another bottom, while it might be real close to the first origin.
image

The algorithm is do the calculation repeatidly until convergence:
θ i : = θ i − α ∂ ∂ θ i J ( θ 0 , θ 1 , . . . ) \theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta_0, \theta_1,...) θi:=θiαθiJ(θ0,θ1,...)

The alpha is called the learning rate, which means the step we take when updating parameters and it’s always a positive number. While i is the index number.

Pay attention to what we saying simultaneously update the parameters, it means that calculating each parameters prior to updating them. We shall do it this way:
correct: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 . . . ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 . . . ) . . . θ 0 = t e m p 0 θ 1 = t e m p 1 . . . \text{correct:}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ ...\\ \theta_0 = temp_0\\ \theta_1 = temp_1\\ ... correct:temp0:=θ0αθ0J(θ0...)temp1:=θ1αθ1J(θ0...)...θ0=temp0θ1=temp1...
instead of doing it like this:
Incorrect: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 . . . ) θ 0 = t e m p 0 t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 . . . ) θ 1 = t e m p 1 . . . {\text{Incorrect:}}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ \theta_0 = temp_0\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ \theta_1 = temp_1\\ ... Incorrect:temp0:=θ0αθ0J(θ0...)θ0=temp0temp1:=θ1αθ1J(θ0...)θ1=temp1...
In the algorithm, alpha controlls the rapidity that the parameter reaches the minimum. Either too small or too large is not appropriate. Too small will take it too many steps to the minimum which would cost a lot of time running the algorithm, while too large could overshot the minimum and may fail to convert or even divert.
image

Batch Gradient Descent:

Refers that in every step of gradient descent we’re looking at all the training examples. So when computing the derivatives, we’re computing the sums of each square error

There are other versions of gradient descent that look at small subsets of the training sets at a time.

(note:’:=’ means assignment same as ‘=’ in java, ‘=’ here means equalty same as ‘==’ in java

Version of Cost Function with Multiple Variable

Hypothesis with multiple variable:

Use n to denote the number of features and $x_n$ denotes the value of nth feature. with combination with superscript m, we get the nth parameter of mth training example:
x j ( i ) = value of feature j in the  i t h  training example x ( i ) = the input(features) of the  i t h  training example x_j^{(i)} = \text{value of feature j in the }i^{th} \text{ training example}\\ x^{(i)} = \text{the input(features) of the }i^{th} \text{ training example} xj(i)=value of feature j in the ith training examplex(i)=the input(features) of the ith training example
So the hypothesis function becomes like this:
h ( θ ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h(\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n h(θ)=θ0+θ1x1+θ2x2+...+θnxn
When we use vectors to describe, we assume there is a x 0 x_0 x0 that equals 1, thus we get a vector that index from 0. assume X → \overrightarrow{X} X as a vector and parameters θ → \overrightarrow{\theta} θ as another vector, and we get two (n+1) by 1 vectors, and the function is like this:
h ( θ ) = [ θ 0 θ 1 θ 2 . . . θ n ] [ x 1 x 1 x 2 . . . x n ] h(\theta) = \begin{bmatrix}\theta_0&\theta_1&\theta_2&...&\theta_n\end{bmatrix}\begin{bmatrix}x_1\\x_1\\x_2\\...\\x_n\end{bmatrix} h(θ)=[θ0θ1θ2...θn]x1x

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值