Reinforcement Learning--Explanation to Formula (5.2)

The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
qπ(s,π′(s))=∑aπ′(a∣s)qπ(s,a)∵for all π(a∣s),there is π(a∣s)={1−ϵ+ϵ/∣A(s)∣if a=A∗ϵ/∣A(s)∣if a≠A∗∴qπ(s,π′(s))=∑a(a≠A∗)ϵ∣A(s)∣qπ(s,a)+(1−ϵ+ϵ∣A(s)∣)qπ(s,a=A∗)=ϵ∣A(s)∣∑a(a≠A∗)qπ(s,a)+ϵ∣A(s)∣qπ(s,a=A∗)+(1−ϵ)qπ(s,a=A∗)=ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)max⁡aqπ(s,a)this is the second line of formula (5.2) q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\ \because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) = \begin{cases} 1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\ \epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\ \end{cases} \\ \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\ &=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\ &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)} \end{aligned} qπ(s,π(s))=aπ(as)qπ(s,a)for all π(as),there is π(as)={1ϵ+ϵ/A(s)ϵ/A(s)if a=Aif a≠Aqπ(s,π(s))=a(a≠A)A(s)ϵqπ(s,a)+(1ϵ+A(s)ϵ)qπ(s,a=A)=A(s)ϵa(a≠A)qπ(s,a)+A(s)ϵqπ(s,a=A)+(1ϵ)qπ(s,a=A)=A(s)ϵaqπ(s,a)+(1ϵ)amaxqπ(s,a)this is the second line of formula (5.2)
Consider value xxx, let
x=∑a[π(a∣s)−ϵ∣A(s)∣]qπ(s,a) x =\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a) x=a[π(as)A(s)ϵ]qπ(s,a)
When a≠A∗a = \not A^*a≠A, π(a∣s)=ϵ/∣A(s)∣\pi(a \mid s) = \epsilon/| \mathcal A(s) |π(as)=ϵ/A(s)
∴x=[π(a=A∗∣s)−ϵ∣A(s)∣]qπ(s,a=A∗)=[1−ϵ+ϵ∣A(s)∣−ϵ∣A(s)∣]qπ(s,a=A∗)=(1−ϵ)qπ(s,a=A∗)=(1−ϵ)max⁡aqπ(s,a)≤max⁡aqπ(s,a) \begin{aligned} \therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\ &= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\ &= ( 1 - \epsilon) q_\pi(s, a=A^*) \\ &= (1-\epsilon)\max_aq_\pi(s,a) \\ &\leq \max_a q_\pi(s,a) \end{aligned} x=[π(a=As)A(s)ϵ]qπ(s,a=A)=[1ϵ+A(s)ϵA(s)ϵ]qπ(s,a=A)=(1ϵ)qπ(s,a=A)=(1ϵ)amaxqπ(s,a)amaxqπ(s,a)
Also
x=(1−ϵ)∑aπ(a∣s)−ϵ∣A(s)∣1−ϵqπ(s,a) x = (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) x=(1ϵ)a1ϵπ(as)A(s)ϵqπ(s,a)
∴qπ(s,π′(s))=ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)max⁡aqπ(s,a)≥ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)∑aπ(a∣s)−ϵ∣A(s)∣1−ϵqπ(s,a) \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\ & \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) \end{aligned} qπ(s,π(s))=A(s)ϵaqπ(s,a)+(1ϵ)amaxqπ(s,a)A(s)ϵaqπ(s,a)+(1ϵ)a1ϵπ(as)A(s)ϵqπ(s,a)
This is the third line of formula (5.2). It’s clear to be understood now.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值