记AAMAS评审意见

本文详细解释了深度强化学习中循环行为者-评论家(C-PACEE)算法的改进,针对连续控制任务的大型状态和动作空间,采用目标评论家网络和经验回放缓冲区。通过调整参数ξ,确保演员主导生成的经验,从而改善学习效果。实验在多个Mujoco环境中进行,包括HalfCheetah、HumanoidStandup等,展示了算法的优势。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一共三位评审给出了意见,心里在滴血,感觉亲生孩子被上了十大酷刑。
第一位给出的意见,很狠,但是真心服气(宝宝委屈)
在这里插入图片描述在这里插入图片描述第二位,正常
在这里插入图片描述第三位,感觉没接触这个方向
在这里插入图片描述

我的回复:
回复3:
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
The use of deep cyclical phased actor-critic is due to the large scale state and action space of continuous control tasks. The target critic network and the replay buffer are the same mechanisms in the paper [16] and equation 12 is also from it. The target critic is used in the calculation of critic loss, in equation 12 and 13. Since PACEE is off-policy, replay buffers are used to store experiences and to break sample correlation to some extent by random sampling. The computation of critic loss is same to DQN, and we use deterministic policy gradients to compute the loss of actor. Compared to PACEE, the actors of C-PACEE work cyclically which is their only difference. So we only show PACEE algorithm. The time complexity of the algorithm is O(n^2), and it will consume some memory resources due to the use of replay buffer and deep neural network. The average time for an episode is about 0.8053 in testing and each Mujoco environment requires an average of approximately 8 hours of training. The CPU of our machine is Inter® Core™ i7-7770. TRPO, PPO, DDPG are the methods in reference [23], [24], [16] respectively and Ant is a continues control task in Mujoco.
回复2:
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:

  1. We think that \xi is in (0,1) . And we found that if \xi is greater than 0.5, the generated experiences are dominated by experience network which is not conducive to actors learning. So we turn the parameter down so that the actors can dominate the generated experiences and finally found that 1e-5 is a good value.
  2. Yes
    回复1:
    Dear reviewer:
    I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
  3. For environments like HalfCheetah, HumanoidStandup, Ant, and Swimmer, since each episode is 1000 time steps, they are all 1000 runs. Like Hopper, Walker2d has an episode length of less than 1000, so they all exceed 1000runs, but all trained one million time steps.
  4. Yes, From the second experiment, we can see that both Reacher and InvertedPendulum converged in one million time steps.
  5. Table 1 combines with the experimental result images to illustrate the advantages of our approaches.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值