Q-learning VS. Sarsa
- Q-learning更新公式 (off-policy):
Q ( s , a ) = Q ( s , a ) + α ( r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ) Q(s,a)=Q(s,a)+\alpha(r+\gamma \max_{a'}Q(s',a')-Q(s,a)) Q(s,a)=Q(s,a)+α(r+γa′maxQ(s′,a′)−Q(s,a))
更新状态 s = s ′ s=s' s=s′. - Sarsa更新公式 (on-policy):
Q ( s , a ) = Q ( s , a ) + α ( r + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) Q(s,a)=Q(s,a)+\alpha(r+\gamma Q(s',a')-Q(s,a)) Q(s,a)=Q(s,a)+α(r+γQ(s′,a′)−Q(s,a))
更新状态 s = s ′ s=s' s=s′,动作 a = a ′ a=a' a=a′.
李宏毅强化学习
- Policy based --> Learning an actor -->代表是 Policy Gradient
- Value based --> Learning a critic --> 代表是 Q-learning
Policy Gradient
目标:调整actor的参数θ,最大化R的期望
Off-policy (PPO方法)
为了重复利用数据
- PPO (Proximal Policy optimization)为了解决θ与θ’别太不一样的问题
Deep Q-learning
2020.12.09放弃了David Silver的课
马尔克夫决策过程
Markov Process
- 没reward,没action,
- 只有 状态S 和 状态转移矩阵P, < S , P > <S,P> <S,P>
Markov Reward Process
- 加reward
- 需要 奖励函数R 和 折扣因子 γ \gamma γ, < S , P , R , γ > <S,P,R,\gamma> <S,P,R,γ>
- 奖励函数R只表示出当前状态的奖励(即时奖励)
- 目标:最大化累计奖赏 G t = R t + 1 + γ R t + 2 + . . . G_t= R_{t+1}+\gamma R_{t+2}+... Gt=Rt+1+γRt+2+...
- Value function: v ( s ) = E [ G t ∣ S t = s ] v(s)=E[G_t|S_t=s] v(s)=E[Gt∣St=s]就是 G t G_t Gt的期望。
Markov Decision Process
- 加决策(动作) A A A, < S , P , A , R , γ > <S,P,A,R,\gamma> <S,P,A,R,γ>
动态规划
Problem | Bellman Equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |