An introduction of Gradient Descent

最新推荐文章于 2019-08-05 09:17:00 发布

转载最新推荐文章于 2019-08-05 09:17:00 发布 · 106 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/PeterShengShijie/p/9243120.html

本文详细介绍了梯度下降算法的基本原理及其在机器学习中的应用。包括梯度的概念、梯度下降算法的工作流程，并通过线性回归案例展示了算法的具体实现过程。此外，还对比了梯度下降与线性回归的差异，并引入了随机梯度下降算法。

Gradient Descent is a useful optimization in machine learning and deep learning. It is a first-order iterative optimization algorithm in find the minimum of a function. To understand the gradient, you must have some knowledge in mathematical analysis.

So let start with the definition of the gradient. In wikipedia, gradient is a multi-variable generalization of the derivative. For example, given a function \(f(x,y,z) = x^2 + 2y+1/z\). The gradient of \(f(x,y,z)\) equals to \(\nabla f(x,y,z)=[2x,2,-\frac{1}{z^2}]^T\). At point \((x_0,y_0,z_0)=(0,1,2)\), the gradient on this point is \(\nabla f(x_0,y_0,z_0)=[0,2,-0.25]^T\). In mathematics, the gradient points are in the direction of the greatest rate of increase of the function. Hence, we can approach to the minimum through the direction of the gradient.

Gradient Descent

Denote the objective function \(f(x), x\in \mathcal{R}^p\). A typical gradient descent can be represented as \[x_{n+1} = x_n - \lambda \nabla f(x_n)\]
Here, the \(\lambda\) is called learning rate. It is vital to choose an appropriate learning rate since either small or large learning rates will lead to a poor performance in finding the optimal solution.

Let us demonstrate gradient descent by a regresion case whose independent variables are centralized.
The regression is \[y=\beta_1 x_1-\beta_2 x_2+\varepsilon\]
\(\beta_1\) and \(\beta_2\) are the parameters we want to estimate. And there are in total \(n\) observations.

In this case, from the definition of ordinary least square, the objective function is \[f(\beta_1,\beta_2) = \sum_{i=1}^{n}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})^2\]
The gradient will by \[\nabla f(\beta_1,\beta_2) = [-\sum_{i=1}^n x_{1,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i});\ \ -\sum_{i=1}^n x_{2,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})]^T\]
The iteration formula:
\[\beta_{1,n+1}=\beta_{1,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
\[\beta_{2,n+1}=\beta_{2,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
The overall steps of gradient descent:

set up an initial point \((\beta_{1,1},\beta_{2,1})\)
generate the next point using the formula listed above
iterate step 2
stop until the objective function converges or reach the specified iterations

# import necessray library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate random variables
x1 = np.array(np.random.randn(100))
x2 = np.array(np.random.randn(100))
residual = np.array(np.random.randn(100))
y = 2*x1-x2+residual

def gradient_descent(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(-2*sum(np.multiply(x1,res)),-2*sum(np.multiply(x2,res)))
def objective_func(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(sum(res**2))

# set the initial point as (0,0)
beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)

# the parameters estimated by Gradient Descent
print(beta1[count],beta2[count])

1.8970883788098543 -1.053406380437433

print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))

The objective function estimated by Gradient Descent:  72.096375419068

Here, we would like to see the evolution path of the gradient descent.

n = 200
xlin = np.linspace(-1, 5, n)
ylin = np.linspace(-4, 2, n)
xlin, ylin = np.meshgrid(xlin, ylin)
obj = np.array(np.zeros(40000)).reshape((200,200))
for i in range(0,n):
    for j in range(0,n):
        obj[i][j] = objective_func(xlin[i][j],ylin[i][j],y,x1,x2)

plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1,beta2,'b')
plt.show()

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

Let us compare Gradient Descent with Linear Regression

# This part is Linear Regression estimation
X = pd.concat([pd.DataFrame(data = x1,columns = ['x1']),pd.DataFrame(data = x2, columns = ['x2'])],axis = 1)
Y = pd.DataFrame(data = y,columns = ['y'])
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,Y)
print(lm.intercept_[0],lm.coef_[0][0],lm.coef_[0][1])
print('\nThe objective function estimated by Linear Regression: ',objective_func(2.026324766879077,-1.1180692401234138,y,x1,x2))

0.04112020730288102 2.026324766879077 -1.1180692401234138

The objective value estimated by Linear Regression:  92.76335553520987

Stochastic Gradient Descent

As for the gradient descent, we use all data to compute the gradient. It may raise our concern that it would be pretty time-consuming when the objective function is complex and the data are very large. Therefore, a Stochastic Gradient Descent(Hereinafter SGD) algorithm is introduced to speed up the process. Frankly, SGD is widely used in machine learning nowadays.

The Major difference between SGD and gradient descent is, in every iteration, Gradient descent will update gradient using all training sample, however, as for SGD, a batch of random samples will be choosed to update the gradient. The formual below shows the iteration of parameters.
\[x_{n+1} = x_n - \lambda \times \nabla sgd\ f(x_n)\]
\[sgd\ f(x_n) = \sum_{i\in t_n} objective\ func(x_n)\]
\[t_n\ is\ a\ subset\ of\ data\]

The general steps for SGD

set up an initial point \((\beta_{1,1},\beta_{2,1})\)
choose a random sample from the all training data
generate the next point using the formula listed above
iterate step 2,3
stop until the objective function converges

Let us implement the SGD using the data in the previous case

beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    flag = np.random.randint(0, 100, size=30)
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)

# the parameters estimated by Stochastic Gradient Descent
print(beta1[count],beta2[count])

1.8763735882460528 -0.9968233184396047

print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))

The objective function estimated by Gradient Descent:  72.4169302773315

plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1[0:(count+1):50],beta2[0:(count+1):50],'b')
plt.show()

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

Currently, we just introduce some basic algorithms about the gradient descent. However, there are other extensions about the gradient descent like AdaGrad and Momentum. I will introduce them in the future.

Reference

Wikipedia [https://en.wikipedia.org/wiki/Gradient#Definition]
Wikipedia [(https://en.wikipedia.org/wiki/Gradient_descent]
Slides from NUS dept.Statistics ST 4240 2015, lectured by Alexandre Hoang THIERY
Large-Scale Machine Learning with Stochastic Gradient Descent [http://leon.bottou.org/publications/pdf/compstat-2010.pdf]

转载于:https://www.cnblogs.com/PeterShengShijie/p/9243120.html