策略学习(Policy-Based Reinforcement Learning)

本文详细介绍了策略网络在强化学习中的应用,它用于近似策略函数,输入状态输出动作概率。策略梯度算法是策略学习的一种方法,通过梯度上升更新策略网络参数,以最大化期望状态价值。文章提到了两种策略梯度算法的计算形式,并讨论了在实际计算中使用蒙特卡洛方法解决的问题。最后,介绍了策略学习的目标是优化策略网络,以提高在特定状态下的胜算。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

策略函数是一个概率密度函数。输入是状态s,输出的概率分布,反映的是接下来采取动作的概率。agent从中做一个随机抽样,如向上是0.7,则可能从中抽取向上的动作。

策略网络,用一个策略网络去拟合近似策略函数。例子:输入是当前状态(可能是一张图片),经过若干卷积层之后生成特征向量,然后经过全连接层把特征向量映射到三维向量(因为游戏里有三个动作),然后用softmax激活函数(该激活函数能将输出全为正数且和为1)将其输出为概率分布,输出的即为每个动作的概率。

动作价值函数:是Ut的条件期望,这个期望把t+1之后的状态s和动作a都消掉了。只依赖于当前的状态st和动作at。还依赖于策略函数π,用不同的π得到的就不一样。可以评价在状态st的情况下&#

Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值