Soft Actor-Critic(SAC算法)

1. 基本概念

1.1 soft Q-value

τ π Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ V ( s t + 1 ) ] \tau ^\pi Q(s_t,a_t)=r(s_t,a_t) + \gamma \cdot E_{s_{t +1}\sim p}[V(s_{t+1})] τπQ(st,at)=r(st,at)+γEst+1p[V(st+1)]

1.2 soft state value function

V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − α ⋅ l o g π ( a t ∣ s t ) ] V(s_t)=E_{a_t \sim \pi}[Q(s_t,a_t)-\alpha \cdot log\pi(a_t|s_t)] V(st)=Eatπ[Q(st,at)αlogπ(atst)]

1.3 Soft Policy Evaluation

Q k + 1 = τ π Q k Q^{k+1}=\tau^\pi Q^k Qk+1=τπQk
当k趋于无穷时, Q k Q^k Qk将收敛至 π \pi π的soft Q-value。
证明:
r π ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) ] r_\pi(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1}))] rπ(st,at)=r(st,at)+γEst+1p[H(π(st+1))]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γEst+1p[H(π(st+1))+Est+1,at+1ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ − l o g ( π ( a t + 1 ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[-log(\pi(a_{t+1} | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γEst+1,at+1ρπ[log(π(at+1st+1))+Est+1,at+1ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) − l o g ( π ( a t + 1 ∣ s t + 1 ) ) Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})-log(\pi(a_{t+1} | s_{t+1})) Q(st,at)=r(st,at)+γEst+1,at+1ρπ[Q(st+1,a

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值