强化学习 --- 前沿技术_exploitation vs-优快云博客

本文链接：https://blog.youkuaiyun.com/micklongen/article/details/121029888

本文深入探讨强化学习的前沿技术，包括Exploitation与Exploration的平衡、样本效率和模型基RL。重点介绍了AlphaGo系列算法，强调其在棋类游戏中的应用及训练过程。同时，提到了模拟到现实（Sim2Real Transfer）的应用，如机器手臂解魔方，以及Meta-RL在快速适应和学习超参数上的作用。多智能体RL的挑战和解决方案，如Social Influence as Intrinsic Motivation机制，也在文中有所阐述。最后，以AlphaStar在星际争霸游戏中的成功为例，展示了强化学习在复杂环境中的潜力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

C. 人工智能 — 强化学习 - 前沿技术

难点

Exploitation VS Exploration
Sample Efficiency

Model-based RL

概述
- 针对真实环境建模
- 通过Model-Network 反馈给 Policy Network
应用场景
- 棋类游戏
特点
- 优点
  - 更好的基于环境做规划
- 缺点
  - 很难完美复现真实环境
算法
- Alpha Go
  - Training
    - Pre-train the policy network using Supervised Learning
    - Self-play and improve the policy network using Policy Gradient
    - Train value network with state-result pairs（collected during Self-play）
  - Inferencing using MCTS
    - Expand a tree node according to the policy network
    - Evaluate states with the help of value network
- AlphaGo Zero
  - No pre-training
  - Self-play（with v.s. without MCTS）
  - Network training（sperately v.s. jointly trained networks）
- Alpha Zero
- MulZero
  - 需要跟盘面编码（embeddings?）
- Dream to Control
  - 应用场景
    - 无法对环境做完全建模
  - 思路
    - 环境建模和训练，交替进行，不断完善
  - 细节
    - Learn dynamics using representation learning
      - Representation
      - Transition
      - Reward
    - Learn behavior with imagined trajectories
      - Action
      - Value

Large-scale RL projects

机器手臂解魔方
- 问题定义
  - 观察：通过多个角度的摄像头观察
  - State：通过CNN转换成 state（vector）
  - Action：事先指定
  - Reward
- Sim2Real Transfer
  - 通过模拟环境，而非真实环境训练
- Automatic Domain Randomization
  - 由于真实环境跟模拟环境的差异
    - 摩擦
    - 重力
    - 魔方表面的污点
    - 等等
  - 思路
    - 不断增加环境的复杂度

Meta-RL

需要追溯历史
可以用 Meta-RL 学习 RL 的超参数、Loss Functions 、Exploration Strategies 。

Priors

概述
- To obtain effective and fast-adapting agents, the agent can rely upon previously distilled knowledge in the form of a prior distribution.
论文
- Simultaneous learning of a goal-agnostic default policy
- Learning a dense embedding space to represent a large set of expert behaviors

Multi-agent RL

定义
- 不同Agent在同一个环境里面，互相学习，互相影响
难点
- Optimal policy is dependent on the other agents’ policies
- Convergence to optimal behavior is not guaranteed
任务分类
- Analysis of emergent behaviors
  - 没有明确的目标，观察一堆agent最后的行为
- Learning communication
  - 先教agent沟通的行为
- Learning cooperation
  - 先教agent合作的行为
- Agents modeling agents
  - 互相学习的能力
算法
- Social Influence as Intrinsic Motivation
  - A mechanism for achieving coordination in multi-agent RL through rewarding agents for having causal Influence over other agents actions.
    - Actions that lead to bigger changes in other agents behavior are considered influential and are rewarded.
    - Influence is assessed using counterfactual reasoning.
  - in agent’s immediate reward is modified:
    - environmental reward + causal influence reward
- AlphaStar：星际争霸机器人
  - 先从人类经验中学习。在最顶上的一条，进行自我对弈。
  - 但是，它把进化中的历史“自我”也存储起来，用来与自己对弈，防止进化方向错误。
  - 此外，还保存了一些过去打败自己的“自己”，然后也用于与自己对弈。