1 SOME CONCEPTS
1.1 Supervised learning
There are input
x
x
x and label
y
y
y.
What the supervised learning needs to do is evaluate
x
x
x to label
y
′
y^{'}
y′ with clear purpose.
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
(
x
(
3
)
,
y
(
3
)
)
(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)})
(x(1),y(1)),(x(2),y(2)),(x(3),y(3))
1.2 Unsupervised learning
In unsupervised learning, there are only input
x
x
x without label
y
y
y.
The method itself can classify the x to different categories named cluster.
(
x
(
1
)
)
,
(
x
(
2
)
)
,
(
x
(
3
)
)
(x^{(1)}),(x^{(2)}),(x^{(3)})
(x(1)),(x(2)),(x(3))
1.3 Overfitting & Underfitting
The curve fitting is far away from data as the left picture called underfitting.
The curve fitting too much to get the rule of the data like the right picture called overfitting.
1.4 Generalization
In machine learning, fitting a set of training data is not enough, you have to make the model work well to the set of test data which is named generalization.
1.5 Cross-validation
All the data can be divided some data sets. We choice one set to be valid-set, one to be test set, and others to be the training set each time.
The different choices of the valid-set from the data sets is called cross-validation.
2 LINEAR REGRESSION
2.1 Principle
For a data set
(
X
,
Y
)
(X,Y)
(X,Y), we use a linear function to fit the data.
The purpose is to predict the value of
y
y
y according to the given
x
x
x with optimal parameters.
2.2 Loss function, cost function and objective function
The loss function computes the error for a single training example.
The cost function is the average of the loss functions of the entire training set.
In linear regression, you have to minimum the structural risk or empirical risk function named objective function to fit just right.
∣
y
i
−
f
(
x
i
)
∣
|y_i-f(x_i)|
∣yi−f(xi)∣
1
N
∑
i
=
1
N
∣
y
i
−
f
(
x
i
)
∣
\frac{1}{N}\sum_{i=1}^N|y_i-f(x_i)|
N1i=1∑N∣yi−f(xi)∣
m
i
n
(
1
N
∑
i
=
1
N
∣
y
i
−
f
(
x
i
)
∣
+
λ
J
(
f
)
)
min(\frac{1}{N}\sum_{i=1}^N|y_i-f(x_i)|+\lambda J(f))
min(N1i=1∑N∣yi−f(xi)∣+λJ(f))
2.3 Optimization
- gradient descent
In a linear regression model, compute each parameter’s gradient, and change the parameter towards the direction of gradient descent. - Newton method
Input: objective function f ( x ) f(x) f(x), gradient g ( x ) = ▽ f ( x ) g(x)=\bigtriangledown f(x) g(x)=▽f(x), Hessian matrix H ( x ) H(x) H(x), precision ϵ \epsilon ϵ.
Output: the minimum point of f ( x ) f(x) f(x).
steps:- select the initial point randomly x 0 x_0 x0, number of iteration k = 0 k=0 k=0 ;
- computes the gradient g ( x k ) g(x_k) g(xk) and Hessian matrix H ( x k ) H(x_k) H(xk) of objective at point x k x_k xk, if ∣ ∣ g ( x k ) ∣ ∣ < ϵ ||g(x_k)||<\epsilon ∣∣g(xk)∣∣<ϵ, stop it and the approximate solution is x ∗ = x k x^*=x_k x∗=xk ;
- update the value of x k + 1 x_{k+1} xk+1 according to the equation: x k + 1 = x k − H − ( x k ) ▽ f ( x k ) x_{k+1}=x_k-H^{-}(x_k)\bigtriangledown f(x_k) xk+1=xk−H−(xk)▽f(xk)
- Quasi-Newton method
The basic idea of quasi-Newton method is replace H − ( x k ) H^-(x_k) H−(xk) using G ( x k ) G(x_k) G(xk) to simplify the calculation process in Newton method.
The rule of replacement is as follows:The matrix G ( x k ) G(x_k) G(xk) is positive ;
G ( x k ) G(x_k) G(xk) satisfy the quasi-Newton condition: G ( x k ) ( ▽ f ( x k + 1 ) − f ( x k ) ) = x k + 1 − x k G(x_k)(\bigtriangledown f(x_{k+1})-f(x_{k}))=x_{k+1}-x_k G(xk)(▽f(xk+1)−f(xk))=xk+1−xk.
Obviously, the choice of G ( x k ) G(x_k) G(xk) is not unique, the common algorithms for that are DEP, BFGS and Broyden algorithm.
2.4 Evaluation index of linear regression
- R-Squared(coefficient of determination)
R 2 = 1 − ∑ ( Y _ a c t u a l − Y _ p r e d i c t ) 2 ∑ ( Y _ a c t u a l − Y _ m e a n ) 2 R^2=1-\frac{\sum(Y\_actual-Y\_predict)^2}{\sum(Y\_actual-Y\_mean)^2} R2=1−∑(Y_actual−Y_mean)2∑(Y_actual−Y_predict)2 - Adjusted R-Squared(degree-of-freedom adjusted coefficient of determination)
R 2 _ a d j u s t e d = 1 − ( 1 − R 2 ) ( n − 1 ) n − p − 1 R^2\_adjusted=1-\frac{(1-R^2)(n-1)}{n-p-1} R2_adjusted=1−n−p−1(1−R2)(n−1) - RMSE
1 N ∑ i = 1 N ( Y _ p r e d i c t − Y _ m e a n ) 2 \sqrt{\frac{1}{N}\sum_{i=1}^{N}(Y\_predict-Y\_mean)^2} N1i=1∑N(Y_predict−Y_mean)2 - MSE
M S E = 1 N ∑ i = 1 N ( Y _ p r e d i c t − Y _ m e a n ) 2 MSE=\frac{1}{N}\sum_{i=1}^{N}(Y\_predict-Y\_mean)^2 MSE=N1i=1∑N(Y_predict−Y_mean)2 - MAE
M A E = 1 N ∑ i = 1 N ∣ Y _ p r e d i c t − Y _ m e a n ∣ MAE=\frac{1}{N}\sum_{i=1}^{N}|Y\_predict-Y\_mean| MAE=N1i=1∑N∣Y_predict−Y_mean∣ - SSE
S S E = ∑ ( Y _ a c t u a l − Y _ p r e d i c t ) 2 SSE=\sum(Y\_actual-Y\_predict)^2 SSE=∑(Y_actual−Y_predict)2 - F Statistic
2.5 Parameters of sklearn
call the function in sklearn
sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
- fit_intercept: if the intercept equals zero
- normalize: whether to normalize the data
- copy_X: if X will be overwritten
- n_jobs: number of cores used in calculation
REFERENCE
[1] https://yoyoyohamapi.gitbooks.io/mit-ml/content/大规模机器学习/articles/梯度下降.html
[2] https://developers.google.com/machine-learning/crash-course/prereqs-and-prework?hl=zh-cn
[3] https://blog.youkuaiyun.com/shy19890510/article/details/79375062
[4] https://blog.youkuaiyun.com/li980828298/article/details/51273385
[5] https://zhuanlan.zhihu.com/p/38185542
[6] https://blog.youkuaiyun.com/zrh_优快云/article/details/81190221