An introduction of Gradient Descent

本文详细介绍了梯度下降算法的基本原理及其在机器学习中的应用。包括梯度的概念、梯度下降算法的工作流程,并通过线性回归案例展示了算法的具体实现过程。此外,还对比了梯度下降与线性回归的差异,并引入了随机梯度下降算法。

Gradient Descent is a useful optimization in machine learning and deep learning. It is a first-order iterative optimization algorithm in find the minimum of a function. To understand the gradient, you must have some knowledge in mathematical analysis.

So let start with the definition of the gradient. In wikipedia, gradient is a multi-variable generalization of the derivative. For example, given a function \(f(x,y,z) = x^2 + 2y+1/z\). The gradient of \(f(x,y,z)\) equals to \(\nabla f(x,y,z)=[2x,2,-\frac{1}{z^2}]^T\). At point \((x_0,y_0,z_0)=(0,1,2)\), the gradient on this point is \(\nabla f(x_0,y_0,z_0)=[0,2,-0.25]^T\). In mathematics, the gradient points are in the direction of the greatest rate of increase of the function. Hence, we can approach to the minimum through the direction of the gradient.

Gradient Descent

Denote the objective function \(f(x), x\in \mathcal{R}^p\). A typical gradient descent can be represented as \[x_{n+1} = x_n - \lambda \nabla f(x_n)\]
Here, the \(\lambda\) is called learning rate. It is vital to choose an appropriate learning rate since either small or large learning rates will lead to a poor performance in finding the optimal solution.

Let us demonstrate gradient descent by a regresion case whose independent variables are centralized.
The regression is \[y=\beta_1 x_1-\beta_2 x_2+\varepsilon\]
\(\beta_1\) and \(\beta_2\) are the parameters we want to estimate. And there are in total \(n\) observations.

In this case, from the definition of ordinary least square, the objective function is \[f(\beta_1,\beta_2) = \sum_{i=1}^{n}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})^2\]
The gradient will by \[\nabla f(\beta_1,\beta_2) = [-\sum_{i=1}^n x_{1,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i});\ \ -\sum_{i=1}^n x_{2,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})]^T\]
The iteration formula:
\[\beta_{1,n+1}=\beta_{1,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
\[\beta_{2,n+1}=\beta_{2,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
The overall steps of gradient descent:

  1. set up an initial point \((\beta_{1,1},\beta_{2,1})\)
  2. generate the next point using the formula listed above
  3. iterate step 2
  4. stop until the objective function converges or reach the specified iterations
# import necessray library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate random variables
x1 = np.array(np.random.randn(100))
x2 = np.array(np.random.randn(100))
residual = np.array(np.random.randn(100))
y = 2*x1-x2+residual
def gradient_descent(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(-2*sum(np.multiply(x1,res)),-2*sum(np.multiply(x2,res)))
def objective_func(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(sum(res**2))
# set the initial point as (0,0)
beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)
# the parameters estimated by Gradient Descent
print(beta1[count],beta2[count])
1.8970883788098543 -1.053406380437433
print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))
The objective function estimated by Gradient Descent:  72.096375419068

Here, we would like to see the evolution path of the gradient descent.

n = 200
xlin = np.linspace(-1, 5, n)
ylin = np.linspace(-4, 2, n)
xlin, ylin = np.meshgrid(xlin, ylin)
obj = np.array(np.zeros(40000)).reshape((200,200))
for i in range(0,n):
    for j in range(0,n):
        obj[i][j] = objective_func(xlin[i][j],ylin[i][j],y,x1,x2)

plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1,beta2,'b')
plt.show()

1430125-20180629140612999-1557277098.png

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

1430125-20180629140557024-744373936.png

Let us compare Gradient Descent with Linear Regression

# This part is Linear Regression estimation
X = pd.concat([pd.DataFrame(data = x1,columns = ['x1']),pd.DataFrame(data = x2, columns = ['x2'])],axis = 1)
Y = pd.DataFrame(data = y,columns = ['y'])
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,Y)
print(lm.intercept_[0],lm.coef_[0][0],lm.coef_[0][1])
print('\nThe objective function estimated by Linear Regression: ',objective_func(2.026324766879077,-1.1180692401234138,y,x1,x2))
0.04112020730288102 2.026324766879077 -1.1180692401234138

The objective value estimated by Linear Regression:  92.76335553520987

Stochastic Gradient Descent

As for the gradient descent, we use all data to compute the gradient. It may raise our concern that it would be pretty time-consuming when the objective function is complex and the data are very large. Therefore, a Stochastic Gradient Descent(Hereinafter SGD) algorithm is introduced to speed up the process. Frankly, SGD is widely used in machine learning nowadays.

The Major difference between SGD and gradient descent is, in every iteration, Gradient descent will update gradient using all training sample, however, as for SGD, a batch of random samples will be choosed to update the gradient. The formual below shows the iteration of parameters.
\[x_{n+1} = x_n - \lambda \times \nabla sgd\ f(x_n)\]
\[sgd\ f(x_n) = \sum_{i\in t_n} objective\ func(x_n)\]
\[t_n\ is\ a\ subset\ of\ data\]

The general steps for SGD

  1. set up an initial point \((\beta_{1,1},\beta_{2,1})\)
  2. choose a random sample from the all training data
  3. generate the next point using the formula listed above
  4. iterate step 2,3
  5. stop until the objective function converges

Let us implement the SGD using the data in the previous case

beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    flag = np.random.randint(0, 100, size=30)
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)
# the parameters estimated by Stochastic Gradient Descent
print(beta1[count],beta2[count])
1.8763735882460528 -0.9968233184396047
print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))
The objective function estimated by Gradient Descent:  72.4169302773315
plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1[0:(count+1):50],beta2[0:(count+1):50],'b')
plt.show()

1430125-20180629140533589-532048758.png

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

1430125-20180629140527445-48235011.png

Currently, we just introduce some basic algorithms about the gradient descent. However, there are other extensions about the gradient descent like AdaGrad and Momentum. I will introduce them in the future.

Reference

  1. Wikipedia [https://en.wikipedia.org/wiki/Gradient#Definition]
  2. Wikipedia [(https://en.wikipedia.org/wiki/Gradient_descent]
  3. Slides from NUS dept.Statistics ST 4240 2015, lectured by Alexandre Hoang THIERY
  4. Large-Scale Machine Learning with Stochastic Gradient Descent [http://leon.bottou.org/publications/pdf/compstat-2010.pdf]

转载于:https://www.cnblogs.com/PeterShengShijie/p/9243120.html

【论文复现】一种基于价格弹性矩阵的居民峰谷分时电价激励策略【需求响应】(Matlab代码实现)内容概要:本文介绍了一种基于价格弹性矩阵的居民峰谷分时电价激励策略,旨在通过需求响应机制优化电力系统的负荷分布。该研究利用Matlab进行代码实现,构建了居民用电行为与电价变动之间的价格弹性模型,通过分析不同时间段电价调整对用户用电习惯的影响,设计合理的峰谷电价方案,引导用户错峰用电,从而实现电网负荷的削峰填谷,提升电力系统运行效率与稳定性。文中详细阐述了价格弹性矩阵的构建方法、优化目标函数的设计以及求解算法的实现过程,并通过仿真验证了所提策略的有效性。; 适合人群:具备一定电力系统基础知识和Matlab编程能力,从事需求响应、电价机制研究或智能电网优化等相关领域的科研人员及研究生。; 使用场景及目标:①研究居民用电行为对电价变化的响应特性;②设计并仿真基于价格弹性矩阵的峰谷分时电价激励策略;③实现需求响应下的电力负荷优化调度;④为电力公司制定科学合理的电价政策提供理论支持和技术工具。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,深入理解价格弹性建模与优化求解过程,同时可参考文中方法拓展至其他需求响应场景,如工业用户、商业楼宇等,进一步提升研究的广度与深度。
针对TC275微控制器平台,基于AUTOSAR标准的引导加载程序实现方案 本方案详细阐述了一种专为英飞凌TC275系列微控制器设计的引导加载系统。该系统严格遵循汽车开放系统架构(AUTOSAR)规范进行开发,旨在实现可靠的应用程序刷写与启动管理功能。 核心设计严格遵循AUTOSAR分层软件架构。基础软件模块(BSW)的配置与管理完全符合标准要求,确保了与不同AUTOSAR兼容工具链及软件组件的无缝集成。引导加载程序本身作为独立的软件实体,实现了与上层应用软件的完全解耦,其功能涵盖启动阶段的硬件初始化、完整性校验、程序跳转逻辑以及通过指定通信接口(如CAN或以太网)接收和验证新软件数据包。 在具体实现层面,工程代码重点处理了TC275芯片特有的多核架构与内存映射机制。代码包含了对所有必要外设驱动(如Flash存储器驱动、通信控制器驱动)的初始化与抽象层封装,并设计了严谨的故障安全机制与回滚策略,以确保在软件更新过程中出现意外中断时,系统能够恢复到已知的稳定状态。整个引导流程的设计充分考虑了时序确定性、资源占用优化以及功能安全相关需求,为汽车电子控制单元的固件维护与升级提供了符合行业标准的底层支持。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
梯度下降优化算法概述 梯度下降是一种常用的优化方法,可以帮助我们找到使目标函数最小化或最大化的参数。随着机器学习和深度学习的发展,各种梯度下降算法也不断涌现。以下是一些常用的梯度下降优化算法的概述: 1. 批量梯度下降(Batch Gradient Descent):在每次迭代中,批量梯度下降使用所有样本的梯度来更新模型参数。适用于训练集较小、模型参数较少的情况。 2. 随机梯度下降(Stochastic Gradient Descent):在每次迭代中,随机梯度下降使用一个单独的样本来更新模型参数。适用于训练集较大、模型参数较多的情况。 3. 小批量梯度下降(Mini-batch Gradient Descent):小批量梯度下降是一种介于批量梯度下降和随机梯度下降之间的方法。它在每次迭代中使用一小部分样本的梯度来更新模型参数。适用于训练集规模较大的情况。 4. 动量(Momentum):动量算法加入了“惯性”的概念,可以加速梯度下降的收敛速度。在每次迭代中,动量算法使用上一次的梯度信息来更新模型参数。 5. 自适应梯度下降(Adaptive Gradient Descent):自适应梯度下降可以自适应地调整每个模型参数的学习率,以便更快地收敛到最优解。比如,Adagrad算法可以针对每个参数单独地调整学习率。 6. 自适应矩估计(Adaptive Moment Estimation):Adam算法是一种结合了Momentum和Adaptive Gradient Descent的算法。它可以自适应调整每个参数的学习率,并利用二阶矩来调整动量。 每种梯度下降算法都有其适用的场合,需要根据问题的性质来选择合适的算法。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值