Reinforcement Learning Exercise 3.29

This exercise focuses on reformulating the Bellman equations for four key value functions in Reinforcement Learning: vπ, v∗, qπ, and q∗ using the state transition probability function p and reward function r. The derivations are provided for each function, illustrating how they can be expressed in terms of these functions." 111434171,10296849,Python爬虫分析2017-2018欧洲五大联赛,"['Python', '数据可视化', '足球分析', '数据爬取', '数据分析']

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.29 Rewrite the four Bellman equations for the four value functions ( v π v_\pi vπ, v ∗ v* v, q π q_\pi qπ, and q ∗ q_* q) in terms of the three argument function p (3.4) and the two-argument function r(3.5).

For v π v_\pi vπ:
v π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] = ∑ a π ( a ∣ s ) [ ∑ s ′ , r r p ( s ′ , r ∣ s , a ) + ∑ s ′ , r γ v π ( s ′ ) p ( s ′ , r ∣ s , a ) ] = ∑ a π ( a ∣ s ) [ ∑ r r p ( r ∣ s , a ) + ∑ s ′ γ v π ( s ′ ) p ( s ′ ∣ s , a ) ] = ∑ a π ( a ∣ s ) [ r ( s , a ) + ∑ s ′ γ v π ( s ′ ) p ( s ′ ∣ s , a ) ] \begin{aligned} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s', r}p(s', r | s, a) \bigl [ r + \gamma v_\pi(s') \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ \sum_{s', r}rp(s', r | s, a) + \sum_{s',r}\gamma v_\pi(s') p(s', r | s, a) \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ \sum_{r}rp( r | s, a) + \sum_{s'}\gamma v_\pi(s') p(s' | s, a) \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ r(s,a)+ \sum_{s'}\gamma v_\pi(s') p(s' | s, a) \bigr ] \\ \end{aligned} vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]=aπ(as)[s,rrp(s,rs,a)+s,rγvπ(s)p(s,rs,a)]=aπ(as)[rrp(rs,a)+sγvπ(s)p(ss,a)]=aπ(as)[r(s,a)+sγvπ(s)p(ss,a)]
For v ∗ v_* v:
v ∗ ( s ) = max ⁡ a { ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] } = max ⁡ a { ∑ s ′ , r r p ( s ′ , r ∣ s , a ) + ∑ s ′ , r γ v ∗ ( s ′ ) p ( s ′ , r ∣ s , a ) } = max ⁡ a { ∑ r r p ( r ∣ s , a ) + ∑ s ′ γ v ∗ ( s ′ ) p ( s ′ ∣ s , a ) } = max ⁡ a { r ( s , a ) + ∑ s ′ γ v ∗ ( s ′ ) p ( s ′ ∣ s , a ) } \begin{aligned} v_*(s) &= \max_a \Bigl \{ \sum_{s',r} p(s',r|s,a) \bigl [ r + \gamma v_*(s') \bigr ] \Bigr \} \\ &= \max_a \Bigl \{ \sum_{s',r} r p(s',r|s,a) + \sum_{s',r}\gamma v_*(s') p(s',r|s,a) \Bigr \} \\ &= \max_a \Bigl \{ \sum_{r} r p(r|s,a) + \sum_{s'}\gamma v_*(s') p(s'|s,a) \Bigr \} \\ &= \max_a \Bigl \{ r(s,a) + \sum_{s'}\gamma v_*(s') p(s'|s,a) \Bigr \} \\ \end{aligned} v(s)=amax{s,rp(s,rs,a)[r+γv(s)]}=amax{s,rrp(s,rs,a)+s,rγv(s)p(s,rs,a)}=amax{rrp(rs,a)+sγv(s)p(ss,a)}=amax{r(s,a)+sγv(s)p(ss,a)}
For q π q_\pi qπ, please look into exercise 3.19. https://blog.youkuaiyun.com/ballade2012/article/details/89164995

For q ∗ q_* q:
q ∗ ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] = ∑ s ′ , r r p ( s ′ , r ∣ s , a ) + ∑ s ′ , r p ( s ′ , r ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) = ∑ r r p ( r ∣ s , a ) + ∑ s ′ p ( s ′ ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) = r ( s , a ) + γ ∑ s ′ P s , s ′ a max ⁡ a ′ q ∗ ( s ′ , a ′ ) \begin{aligned} q_*(s,a) &= \sum_{s',r} p(s', r|s,a) \bigl [ r + \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s', r} r p(s',r|s,a) + \sum_{s',r} p(s',r|s,a) \gamma \max_{a'} q_*(s', a') \\ &= \sum_{ r} r p(r|s,a) + \sum_{s'} p(s'|s,a) \gamma \max_{a'} q_*(s', a') \\ &= r (s,a) + \gamma \sum_{s'} P_{s,s'}^a \max_{a'} q_*(s', a') \\ \end{aligned} q(s,a)=s,rp(s,rs,a)[r+γamaxq(s,a)]=s,rrp(s,rs,a)+s,rp(s,rs,a)γamaxq(s,a)=rrp(rs,a)+sp(ss,a)γamaxq(s,a)=r(s,a)+γsPs,saamaxq(s,a)

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值