【MATLAB强化学习工具箱】学习笔记--训练DDPG智能体控制二阶双积分系统Train DDPG Agent to Control Double Integrator System

DDPG是deep deterministic policy gradient深度确定性策略梯度算法的缩写。

环境

控制目标是通过力输入控制一个质量块的位置。

env = rlPredefinedEnv("DoubleIntegrator-Continuous")

 质量块做一维运动,边界为[-4m,+4m];

可观测量为质量块的位置和速度;

迭代终止条件:质量块移动距离超过5m或|x|<0.01

每一个时间步长的奖励r_t由下式定义:

r(t)=-(x(t)'Qx(t)+u(t)'Ru(t))

其中:

x是质量块的状态向量;

u是施加的力;

Q是控制性能的权重;Q=\begin{bmatrix} 10 & 0\\ 0 & 1 \end{bmatrix}

R是控制效果;R=0.01

以上信息也可通过查看env属性看到。

env = 
  DoubleIntegratorContinuousAction with properties:

             Gain: 1
               Ts: 0.1000
      MaxDistance: 5
    GoalThreshold: 0.0100
                Q: [2x2 double]
                R: 0.0100
         MaxForce: Inf
            State: [2x1 double]

观测器信息:

obsInfo = getObservationInfo(env)
numObservations = obsInfo.Dimension(1)

obsInfo = 
  rlNumericSpec with properties:

     LowerLimit: -Inf
     UpperLimit: Inf
           Name: "states"
    Description: "x, dx"
      Dimension: [2 1]
       DataType: "double"
numObservations =
     2

可见对x和dx的观测范围为-\infty+\infty。第一行表示位置,第二行表示速度。

 动作空间信息:

actInfo = getActionInfo(env)
numActions = numel(actInfo)

actInfo = 
  rlNumericSpec with properties:

     LowerLimit: -Inf
     UpperLimit: Inf
           Name: "force"
    Description: [0×0 string]
      Dimension: [1 1]
       DataType: "double"
numActions =
     1

 随机数种子重置

rng(0)

生成DDPG Agent

Critic网络

statePath = imageInputLayer([numObservations 1 1],'Normalization','none','Name','state');
actionPath = imageInputLayer([numActions 1 1],'Normalization','none','Name','action');
commonPath = [concatenationLayer(1,2,'Name','concat')
             quadraticLayer('Name','quadratic')
             fullyConnectedLayer(1,'Name','StateValue','BiasLearnRateFactor',0,'Bias',0)];

criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);

criticNetwork = connectLayers(criticNetwork,'state','concat/in1');
criticNetwork = connectLayers(criticNetwork,'action','concat/in2');

如下图所示

plot(criticNetwork)

 

criticOpts = rlRepresentationOptions('LearnRate',5e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);

Actor网络

actorNetwork = [
    imageInputLayer([numObservations 1 1],'Normalization','none','Name','state')
    fullyConnectedLayer(numActions,'Name','action','BiasLearnRateFactor',0,'Bias',0)];

actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);

Agent

Agent参数配置

agentOpts = rlDDPGAgentOptions(...
    'SampleTime',env.Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'DiscountFactor',0.99,...
    'MiniBatchSize',32);
agentOpts.NoiseOptions.StandardDeviation = 0.3;
agentOpts.NoiseOptions.StandardDeviationDecayRate = 1e-6;

Agent组装生成【说明:这一行为标准写法】 

agent = rlDDPGAgent(actor,critic,agentOpts);

训练Agent

train参数配置

trainOpts = rlTrainingOptions(...
    'MaxEpisodes', 5000, ...
    'MaxStepsPerEpisode', 200, ...
    'Verbose', false, ...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-66);

开始训练

trainingStats = train(agent,env,trainOpts)

训练过程如下图所示

 官方帮助文档中训练到第3430步才达到终止条件'StopTrainingValue',-66。本文训练了3个小时才到500步左右,不过基本趋势已经接近最终结果,震荡期已过。

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值