Regression Problems
Regression Problems
-
Supervised learning:
Given a data set that we already know the correct output is like, and want to discover the relationship between the input and the output.
supervised learning is catagorized into “regression” and “classification” problems.- regression - function that generate continuous output
- example:house price prediction
- classification - classify the target into catagories which is a discrete output
- example: classifying whether a tumor is malignant or benign
- regression - function that generate continuous output
-
Unsupervised learning:
Given a data set that doesn’t have any labels or has the same label, and want to find some structure in the data.-
clustering - group a large set of data into groups that is somehow similar or related by different variables or attributes.
- examples:
- News grouping
- Market segmentation
- examples:
-
non-clustering - find struction in a chaotic environment
- examples:
- Cocktail party problem - cocktail party algorithm
- examples:
-
Cost Function(Squared error function/Mean squared error):
J = 1 2 m ∗ ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J = \frac{1}{2m}*\sum_{\mathclap{i=1}}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2 J=2m1∗i=1∑n(hθ(x(i))−y(i))2
Either devided by 2 * m or m results in the same thetas, while 2 * m used here is for convenience when taking derivative.
We can cut this function into two parts:
J ( θ 0 , θ 1 ) y ( j ) = h ( x j ) = θ 0 + t h e t a 1 ∗ x j J(\theta_0,\theta_1)\\ y(j) = h(x_j) = \theta_0 + theta_1 * x_j J(θ0,θ1)y(j)=h(xj)=θ0+theta1∗xj
At first we can set θ \theta θ to 0, and set θ \theta θ to certain values and will get corresponding h ( x ) h(x) h(x). Then we take the training set to the function to calculate h ( x i ) h(x_i) h(xi) and then take the result into the function J and get a discrete J ( 0 , θ 1 ) J(0, \theta_1) J(0,θ1), as plotted below:
That’s when J is a univariable function. when θ 0 \theta_0 θ0 comes in, the J function would be plotted like this:
And transfer it in a contour plot will be like this:
Plotting out our training set and select some fixed value for the two thetaes, we have hs of x. And we can point out the corresponding J ( θ ) J(\theta) J(θ). matching the hs and the Js gives a clear version of the relationship of these two functions:
Gradient descent
To minimize all kinds of functions, we use gradient decent. when saying using gradient descent, we always mean that simultaneously subtract the variable.
As intuition in picture, we can say the descent comes as baby step as step walking down the hill, and finally reaching to the bottom of that route. If you start with another position on the hill, you would be reaching another bottom, while it might be real close to the first origin.
The algorithm is do the calculation repeatidly until convergence:
θ i : = θ i − α ∂ ∂ θ i J ( θ 0 , θ 1 , . . . ) \theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta_0, \theta_1,...) θi:=θi−α∂θi∂J(θ0,θ1,...)
The alpha is called the learning rate, which means the step we take when updating parameters and it’s always a positive number. While i is the index number.
Pay attention to what we saying simultaneously update the parameters, it means that calculating each parameters prior to updating them. We shall do it this way:
correct: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 . . . ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 . . . ) . . . θ 0 = t e m p 0 θ 1 = t e m p 1 . . . \text{correct:}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ ...\\ \theta_0 = temp_0\\ \theta_1 = temp_1\\ ... correct:temp0:=θ0−α∂θ0∂J(θ0...)temp1:=θ1−α∂θ1∂J(θ0...)...θ0=temp0θ1=temp1...
instead of doing it like this:
Incorrect: t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 . . . ) θ 0 = t e m p 0 t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 . . . ) θ 1 = t e m p 1 . . . {\text{Incorrect:}}\\ temp_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0...)\\ \theta_0 = temp_0\\ temp_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0...)\\ \theta_1 = temp_1\\ ... Incorrect:temp0:=θ0−α∂θ0∂J(θ0...)θ0=temp0temp1:=θ1−α∂θ1∂J(θ0...)θ1=temp1...
In the algorithm, alpha controlls the rapidity that the parameter reaches the minimum. Either too small or too large is not appropriate. Too small will take it too many steps to the minimum which would cost a lot of time running the algorithm, while too large could overshot the minimum and may fail to convert or even divert.
Batch Gradient Descent:
Refers that in every step of gradient descent we’re looking at all the training examples. So when computing the derivatives, we’re computing the sums of each square error
There are other versions of gradient descent that look at small subsets of the training sets at a time.
(note:’:=’ means assignment same as ‘=’ in java, ‘=’ here means equalty same as ‘==’ in java
Version of Cost Function with Multiple Variable
Hypothesis with multiple variable:
Use n to denote the number of features and $x_n$
denotes the value of nth feature. with combination with superscript m, we get the nth parameter of mth training example:
x j ( i ) = value of feature j in the i t h training example x ( i ) = the input(features) of the i t h training example x_j^{(i)} = \text{value of feature j in the }i^{th} \text{ training example}\\ x^{(i)} = \text{the input(features) of the }i^{th} \text{ training example} xj(i)=value of feature j in the ith training examplex(i)=the input(features) of the ith training example
So the hypothesis function becomes like this:
h ( θ ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h(\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n h(θ)=θ0+θ1x1+θ2x2+...+θnxn
When we use vectors to describe, we assume there is a x 0 x_0 x0 that equals 1, thus we get a vector that index from 0. assume X → \overrightarrow{X} X as a vector and parameters θ → \overrightarrow{\theta} θ as another vector, and we get two (n+1) by 1 vectors, and the function is like this:
h ( θ ) = [ θ 0 θ 1 θ 2 . . . θ n ] [ x 1 x 1 x 2 . . . x n ] h(\theta) = \begin{bmatrix}\theta_0&\theta_1&\theta_2&...&\theta_n\end{bmatrix}\begin{bmatrix}x_1\\x_1\\x_2\\...\\x_n\end{bmatrix} h(θ)=[θ0θ1θ2...θn]⎣⎢⎢⎢⎢⎡x1x