reinforcement learning,增强学习:Policy Gradient

本文深入探讨了强化学习中的Policy Gradient方法,包括Finite Difference Policy Gradient、Monte-Carlo Policy Gradient和Actor-Critic Policy Gradient。强调了Policy-Based RL在高维连续动作空间的优势以及可能面临的局部最优和高方差问题。介绍了Actor-Critic方法如何通过减少方差和避免偏差来改进收敛性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >



上节课和本节课内容


具体的:

 Finite Difference Policy Gradient

 Monte-Carlo Policy Gradient

Actor-Critic Policy Gradient

区别和联系:

 

Advantages of Policy-Based RL:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies(课件中有个Example: Aliased Gridworld,很好理解


Disadvantages of Policy-Based RL :
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
 


Policy-Gradient RL 问题的数学化描述:


上面三种策略都可以得到相同的policy gradient,方法如下(就是基于gradient求最优θ 的过程)。









 Finite Difference Policy Gradient(就是我们常说的gradient checking):






 Monte-Carlo Policy Gradient:

从下面开始,认为 πθ(s,a) 是关于θ的函数(这个函数的映射是probability(a| s, θ));

MC-PG通过采样episode,确定最优的θ。

1)以one-step的MDP为例:



2)Policy Gradient Theorem 将 [ one-step的MDP ] 扩展到 [ 任意的differentiable policy和任意的policy objective functions ]:



3)Monte-Carlo Policy Gradient (REINFORCE) 


缺点是:high variance,slow convergence rate。





Actor-Critic Policy Gradient:

1)motivation

为了降低variance,可以考虑先使用上节课讲的value-function approximation方法计算一个Qw(s,a),然后用Qw(s,a)去指导Qπ(s,a)的梯度更新。



2)基于action-value fn approx. 和TD(0)的简单actor-critic algorithm 


可以看出来,θ,w都要更新。


3)Compatible Function Approximation Theorem:

actor-critic algorithm的问题:

Approximating the policy gradient introduces bias。A biased policy gradient may not find the right solution
Luckily, if we choose value function approximation carefully, then we can avoid introducing any bias, i.e. We can still follow the exact policy gradient .

证明省略。。。


Advantage Actor-Critic 和 Natural Actor-Critic 也不写了。。。

但简单总结一下:








### DRL-TransKey Paper Overview The **DRL-TransKey** paper focuses on the application of deep reinforcement learning (DRL) techniques to achieve policy gradient-based motion representation transfer, specifically transitioning from keyframe animations to more dynamic and adaptive motions[^1]. This approach leverages advanced machine learning models that allow for a seamless integration between traditional hand-crafted animations and AI-driven procedural generation. In this context, the method utilizes policy gradients as an optimization technique within the framework of reinforcement learning. The primary goal is to learn policies that can generalize across different scenarios while preserving the artistic intent embedded in original keyframes[^2]. #### Key Concepts Discussed in the Paper One significant aspect highlighted involves representing complex movements through latent space embeddings derived via autoencoders or variational methods before applying them into RL environments where agents interactively refine their behaviors over time steps under reward signals defined by task objectives such as smoothness, realism preservation etc.[^3] Additionally, it introduces mechanisms like curriculum learning which gradually increases difficulty levels during training phases ensuring stable convergence towards optimal solutions without falling prey common pitfalls associated naive implementations involving high dimensional continuous action spaces typical character control problems found video games industry applications among others areas requiring sophisticated motor skills simulation tasks performed virtual characters controlled autonomously using learned strategies rather than scripted sequences alone thus enhancing overall flexibility adaptability real world conditions encountered various domains including robotics autonomous vehicles beyond mere entertainment purposes only but also extending scientific research experimental setups needing precise manipulations objects environments alike depending upon specific requirements set forth each individual case study considered throughout entire document length covering multiple aspects ranging theoretical foundations practical implementation details alongside empirical evaluations demonstrating effectiveness proposed methodologies against baseline comparisons established literature review sections provided earlier parts text body itself too! ```python import gym from stable_baselines3 import PPO env = gym.make('CustomMotionEnv-v0') # Hypothetical environment setup model = PPO("MlpPolicy", env, verbose=1) def train_model(): model.learn(total_timesteps=100_000) train_model() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值