C. 人工智能 — 强化学习 - 前沿技术
难点
- Exploitation VS Exploration
- Sample Efficiency
Model-based RL
- 概述
- 针对真实环境建模
- 通过Model-Network 反馈给 Policy Network
- 应用场景
- 棋类游戏
- 特点
- 优点
- 更好的基于环境做规划
- 缺点
- 很难完美复现真实环境
- 优点
- 算法
- Alpha Go
- Training
- Pre-train the policy network using Supervised Learning
- Self-play and improve the policy network using Policy Gradient
- Train value network with state-result pairs(collected during Self-play)
- Inferencing using MCTS
- Expand a tree node according to the policy network
- Evaluate states with the help of value network
- Training
- AlphaGo Zero
- No pre-training
- Self-play(with v.s. without MCTS)
- Network training(sperately v.s. jointly trained networks)
- Alpha Zero
- MulZero
- 需要跟盘面编码(embeddings?)
- Dream to Control
- 应用场景
- 无法对环境做完全建模
- 思路
- 环境建模和训练,交替进行,不断完善
- 细节
- Learn dynamics using representation learning
- Representation
- Transition
- Reward
- Learn behavior with imagined trajectories
- Action
- Value
- Learn dynamics using representation learning
- 应用场景
- Alpha Go
Large-scale RL projects
- 机器手臂解魔方
- 问题定义
- 观察:通过多个角度的摄像头观察
- State:通过CNN转换成 state(vector)
- Action:事先指定
- Reward
- Sim2Real Transfer
- 通过模拟环境,而非真实环境训练
- Automatic Domain Randomization
- 由于真实环境跟模拟环境的差异
- 摩擦
- 重力
- 魔方表面的污点
- 等等
- 思路
- 不断增加环境的复杂度
- 由于真实环境跟模拟环境的差异
- 问题定义
Meta-RL
- 需要追溯历史
- 可以用 Meta-RL 学习 RL 的超参数、Loss Functions 、Exploration Strategies 。
Priors
- 概述
- To obtain effective and fast-adapting agents, the agent can rely upon previously distilled knowledge in the form of a prior distribution.
- 论文
- Simultaneous learning of a goal-agnostic default policy
- Learning a dense embedding space to represent a large set of expert behaviors
Multi-agent RL
- 定义
- 不同Agent在同一个环境里面,互相学习,互相影响
- 难点
- Optimal policy is dependent on the other agents’ policies
- Convergence to optimal behavior is not guaranteed
- 任务分类
- Analysis of emergent behaviors
- 没有明确的目标,观察一堆agent最后的行为
- Learning communication
- 先教agent沟通的行为
- Learning cooperation
- 先教agent合作的行为
- Agents modeling agents
- 互相学习的能力
- Analysis of emergent behaviors
- 算法
- Social Influence as Intrinsic Motivation
- A mechanism for achieving coordination in multi-agent RL through rewarding agents for having causal Influence over other agents actions.
- Actions that lead to bigger changes in other agents behavior are considered influential and are rewarded.
- Influence is assessed using counterfactual reasoning.
- in agent’s immediate reward is modified:
- environmental reward + causal influence reward
- A mechanism for achieving coordination in multi-agent RL through rewarding agents for having causal Influence over other agents actions.
- AlphaStar:星际争霸机器人
- 先从人类经验中学习。在最顶上的一条,进行自我对弈。
- 但是,它把进化中的历史“自我”也存储起来,用来与自己对弈,防止进化方向错误。
- 此外,还保存了一些过去打败自己的“自己”,然后也用于与自己对弈。
- Social Influence as Intrinsic Motivation