记AAMAS评审意见

最新推荐文章于 2024-09-28 22:35:17 发布

geter_CS

最新推荐文章于 2024-09-28 22:35:17 发布

阅读量2.3k

点赞数 3

CC 4.0 BY-SA版权

分类专栏：其他文章标签：评审意见

本文链接：https://blog.youkuaiyun.com/geter_CS/article/details/86132513

其他专栏收录该内容

9 篇文章

订阅专栏

本文详细解释了深度强化学习中循环行为者-评论家（C-PACEE）算法的改进，针对连续控制任务的大型状态和动作空间，采用目标评论家网络和经验回放缓冲区。通过调整参数ξ，确保演员主导生成的经验，从而改善学习效果。实验在多个Mujoco环境中进行，包括HalfCheetah、HumanoidStandup等，展示了算法的优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一共三位评审给出了意见，心里在滴血，感觉亲生孩子被上了十大酷刑。
第一位给出的意见，很狠，但是真心服气（宝宝委屈）
在这里插入图片描述第二位，正常
第三位，感觉没接触这个方向

我的回复：
回复3：
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
The use of deep cyclical phased actor-critic is due to the large scale state and action space of continuous control tasks. The target critic network and the replay buffer are the same mechanisms in the paper [16] and equation 12 is also from it. The target critic is used in the calculation of critic loss, in equation 12 and 13. Since PACEE is off-policy, replay buffers are used to store experiences and to break sample correlation to some extent by random sampling. The computation of critic loss is same to DQN, and we use deterministic policy gradients to compute the loss of actor. Compared to PACEE, the actors of C-PACEE work cyclically which is their only difference. So we only show PACEE algorithm. The time complexity of the algorithm is O(n^2), and it will consume some memory resources due to the use of replay buffer and deep neural network. The average time for an episode is about 0.8053 in testing and each Mujoco environment requires an average of approximately 8 hours of training. The CPU of our machine is Inter® Core™ i7-7770. TRPO, PPO, DDPG are the methods in reference [23], [24], [16] respectively and Ant is a continues control task in Mujoco.
回复2：
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:

We think that \xi is in (0,1) . And we found that if \xi is greater than 0.5, the generated experiences are dominated by experience network which is not conducive to actors learning. So we turn the parameter down so that the actors can dominate the generated experiences and finally found that 1e-5 is a good value.
Yes
回复1：
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
For environments like HalfCheetah, HumanoidStandup, Ant, and Swimmer, since each episode is 1000 time steps, they are all 1000 runs. Like Hopper, Walker2d has an episode length of less than 1000, so they all exceed 1000runs, but all trained one million time steps.
Yes, From the second experiment, we can see that both Reacher and InvertedPendulum converged in one million time steps.
Table 1 combines with the experimental result images to illustrate the advantages of our approaches.