强化学习是机器学习中的一个子领域,其中智能体通过与环境的交互,观测交互结果以及获得相应的回报。这种学习的方式是模拟人或动物的学习过程
我们人类,与我们所处的环境有一个直接的感官接触,我们可以通过执行动作,目睹动作所产生的影响。这个观点可以理解成“cause and effect”,毫无疑问地,这就是我们人生中建立起对环境的认知的关键。本文章将从以下几个方面介绍强化学习在机器人中的应用:
- 强化学习目前应用在哪些方面
- 为什么强化学习与机器人有密切的联系
- 机器人强化学习简介
- 值函数近似
- 机器人强化学习挑战
- 维度灾难
- 实际环境采样灾难
- 模型限制和模型不确定性灾难
- 机器人强化学习准则
- 有效的表征
- 模型近似
- 先验知识
强化学习目前应用在哪些方面
很多问题都已经通过强化学习得以解决,因为RL所设计的智能体不需要通过专家进行监督学习,对于那些复杂的没有明显或不容易通过流程来解决的这类问题就最适合使用强化学习得以解决,比如说:
- Game playing – 在游戏场景中做出最佳的动作往往依赖与很多因素,特别是在特定的游戏中可能的游戏状态有非常多的时候。要想通过传统的方法去覆盖如此多的状态,意味着将要设定非常多的人工定制的规则,RL将会不需要人为定制规则,智能体能通过玩游戏学到规则。对于双人对抗游戏比如“backgammon”,智能体能够通过与人类玩家或者其他RL智能体对抗游戏中得到训练。
- Control problems – 比如电梯调度问题。同样的,没有一个明显的策略能够提供最好最省时的电梯服务。对于这样的控制问题,RL智能体能够在仿真环境中进行学习最终学到最佳的控制策略。RL在控制问题中的应用的一些优点是能够很容易的适应环境的变化,连续不断的训练,同时不断的提升自己的性能。
一个最近的比较好的列子是DeepMind’s paper Human-level control through deep reinforcement learning
为什么强化学习与机器人有密切的联系
J. Kober, J. Andrew (Drew) Bagnell, and J. Peters 在Reinforcement Learning in Robotics: A Survey中指出:
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors
强化学习提供给机器人学一个设计复杂和人为难以设定的工程的行为的工具集和框架
同时,强化学习逐渐成为现实世界中一个普遍存在的工具。一般来说,对于人类来说复杂的问题恰好机器人可能会很容易解决,同时,对于我们人类来说简单的问题,机器人可能解决起来会非常复杂。也导致很多人认为,机器人可以解决复杂但又在每次试验任务上表现简单,换句话说:
What’s complex for a human, robots can do easily and viceversa - Víctor Mayoral Vilches
举个简单例子,想象我们桌上有一个三关节的操作机器人从事某项重复任务,传统来说,机器人工程师处理这样一个特定任务要么设定整个应用要么使用已有的工具(已经由制造商提供)去编程设计这个应用案例。不管这个工具和任务的复杂程度,我们都会遇到:
- 逆运动学(Inverse Kinematics)时产生的每个电机(关节)的误差
- 设定闭环的时候模型精度
- 设计整个控制流程
- 经常的程序式标定(Calibration)
所有的这些是为了让机器人在被控制环境下产生一个确定的动作。
但是事实是:真实环境是不可控
Problems in robotics are often best represented with high-dimensional, continuous states and actions (note that the 10-30 dimensional continuous actions common in robot reinforcement learning are considered large (Powell, 2012)). In robotics, it is often unrealistic to assume that the true state is completely observable and noise-free.
机器人中问题一般都表现为:高维,连续状态,连续动作(通常10-30维的连续动作空间在机器人强化学习中都认为是巨大的),在机器人学中,假设状态空间被完全观察是不现实的,同时观测量一般都带有噪声
回到J. Kober 等的文章Reinforcement Learning in Robotics: A Survey
Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. Instead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot.
强化学习会是机器人通过与环境交互地式错学习的方式自主的发现最优行为。不需要关心解决问题的具体细节,强化学习中任务的设计器将会依据目标函数提供反馈,以度量机器人每一步的表现性能。
这样说是很有道理的。举个投篮的列子:
- I get myself behind the 3 point line and get ready for a shot
- At this point, my consciousness has no whatsoever information about the exact distance to the basket, neither the strength I should use to use to make the shot so my brain produces an estimate based on the model that I have (built upon years of trial an error shots)
- With this estimate, I produce a shot. Let’s assume I miss the shot, which I notice through my eyes (sensors). Again, the information perceived through my eyes is not accurate thereby what I pass to my brain is not: “I missed the shot by 5.45 cm to the right” but more like “The shot was slightly too much to the right and i missed”. This information updates the model in my brain which receives a negative reward. We could get ourselves discussing about why did my estimate failed. Was it because the model is wrong regardless of the fact that I’ve made hundreds of 3-pointers before with that exact model? Was it because of the wind (playing outdoors generally)? or was it because i didn’t eat properly that morning?. It could easily be all of those or none, but the fact is that many of those aspects can’t really be controlled be me. So i proceed iterating.
- With the updated model, i make another shot which in case it fails drives me to step 2) but if I make it, I proceed to step 5).
- Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.
Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.