提出问题:
To use reinforcement learning successfully insituations approaching real-world complexity, however, agents are confronted with a difficult task: theymust derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience tonewsituations.
模型:
a deep Q-network(DQN)
combine RL with deep neural networks --- learn concepts such as object categories directly from raw sensory data
use deep CNN to approximate the optimal action-value function
RL 不稳定性的原因:
- 观察序列之间存在相关性(数据不独立)
- Q函数有一个small update时,可能导致policy有很大的变化
- Q值与目标值
solution:
- Experience replay:randomizes over the data,removing correlations in the observation sequence and smoothing over changes in the data distribution
- An iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target