强化学习—— Soft Actor-Critic(SAC算法
1. 基本概念
1.1 soft Q-value
τ π Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ V ( s t + 1 ) ] \tau ^\pi Q(s_t,a_t)=r(s_t,a_t) + \gamma \cdot E_{s_{t +1}\sim p}[V(s_{t+1})] τπQ(st,at)=r(st,at)+γ⋅Est+1∼p[V(st+1)]
1.2 soft state value function
V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − α ⋅ l o g π ( a t ∣ s t ) ] V(s_t)=E_{a_t \sim \pi}[Q(s_t,a_t)-\alpha \cdot log\pi(a_t|s_t)] V(st)=Eat∼π[Q(st,at)−α⋅logπ(at∣st)]
1.3 Soft Policy Evaluation
Q k + 1 = τ π Q k Q^{k+1}=\tau^\pi Q^k Qk+1=τπQk
当k趋于无穷时, Q k Q^k Qk将收敛至 π \pi π的soft Q-value。
证明:
r π ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) ] r_\pi(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1}))] rπ(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ − l o g ( π ( a t + 1 ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[-log(\pi(a_{t+1} | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[−log(π(at+1∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) − l o g ( π ( a t + 1 ∣ s t + 1 ) ) Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})-log(\pi(a_{t+1} | s_{t+1})) Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[Q(st+1,a

最低0.47元/天 解锁文章
4051

被折叠的 条评论
为什么被折叠?



