Reinforcement Learning
an agent performs actions in environment, and receives rewards
goal: Learn how to take actions that maximize reward
Stochasticity: Rewards and state transitions may be random
Credit assignment: Reward rtr_trt may not directly depend on action ata_tat
Nondifferentiable: Can’t backprop through the world
Nonstationary: What the agent experiences depends on how it acts
Markov Decision Process (MDP)
Mathematical formalization of the RL problem: A tuple (S,A,R,P,γ)(S,A,R,P,\gamma)(S,A,R,P,γ)
SS