Reinforcement Learning Exercise 3.11

在当前状态St下,根据概率策略π选择行动,本文详细推导了在该状态下Rt+1的期望值,利用四参数函数p(3.2)进行表达,并最终得出期望回报的数学公式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.11 If the current state is StS_tSt, and actions are selected according to stochastic policy π\piπ, then what is the expectation of Rt+1R_{t+1}Rt+1 in terms of π\piπ and the four-argument function ppp(3.2)?

Pr(St=s,At=a)=p(a∣s)⋅Pr(St=s)=π(a∣s)⋅Pr(St=s)(1) \begin{aligned} Pr(S_t = s, A_t = a) &= p(a|s) \cdot Pr(S_t = s) \\ &= \pi(a|s) \cdot Pr(S_t = s) \qquad{(1)} \end{aligned} Pr(St=s,At=a)=p(as)Pr(St=s)=π(as)Pr(St=s)(1)
E(Rt+1∣St=s)=∑r∈R[r⋅Pr(Rt+1=r∣St=s)]=∑r∈R[r⋅∑s′∈SPr(Rt+1=r,St+1=s′∣St=s)]=∑r∈R[r⋅∑s′∈SPr(Rt+1=r,St+1=s′,St=s)Pr(St=s)]=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′,St=s,At=a)Pr(St=s)]=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)Pr(St=s)](2) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot Pr(R_{t+1} = r|S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{Pr (R_{t+1} = r, S_{t+1} = s', S_t = s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s', S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot Pr(S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \qquad{(2)} \\ \end{aligned} E(Rt+1St=s)=rR[rPr(Rt+1=rSt=s)]=rR[rsSPr(Rt+1=r,St+1=sSt=s)]=rR[rsSPr(St=s)Pr(Rt+1=r,St+1=s,St=s)]=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=s,St=s,At=a)]=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=sSt=s,At=a)Pr(St=s,At=a)](2)
Substitute equation (1) into (2), there is :
E(Rt+1∣St=s)=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)⋅Pr(St=s)Pr(St=s)]=∑r∈R{r⋅∑s′∈S∑a∈A[Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)]}=∑r∈R{r⋅∑s′∈S∑a∈A[p(r,s′∣s,a)⋅π(a∣s)]}(3) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s)\cdot Pr(S_t=s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s) \bigr ] \Bigr \} \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ p(r, s' | s, a) \cdot \pi(a|s) \bigr ] \Bigr \} \qquad{(3)} \\ \end{aligned} E(Rt+1St=s)=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=sSt=s,At=a)π(as)Pr(St=s)]=rR{rsSaA[Pr(Rt+1=r,St+1=sSt=s,At=a)π(as)]}=rR{rsSaA[p(r,ss,a)π(as)]}(3)
Equation (3) is the result.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值