数学公式警告
Policy Gradient
J(πθ)=∫Sρπ(s)∫Aπθ(s,a)r(s,a)dads=Es∼ρπ,a∼πθ[r(s,a)] \begin{aligned}J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \pi_\theta (s,a)r(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[r(s,a)]\end{aligned} J(πθ)==∫Sρπ(s)∫Aπθ(s,a)r(s,a)dadsEs∼ρπ,a∼πθ[r(s,a)]
ρπ(s′)=∫S∑t=1∞γt−1p1(s)p(s→s′,t,π)ds\rho^\pi(s') = \int_S \sum_{t=1}^ {\infty} \gamma^{t-1}p_1(s)p(s\to s',t,\pi)dsρπ(s′)=∫S∑t=1∞γt−1p1(s)p(s→s′,t,π)ds
其中p1(s)p_1(s)p1(s)表示初始状态为s的概率
p(s−>s′,t,π)p(s->s',t,\pi)p(s−>s′,t,π)表示在策略π\piπ下状态s经过t时间到达s’
SPG
stochastic policy gradient
随机指随即策略πθ(a∣s)=P[a∣s,;θ]\pi_\theta(a|s)=P[a|s,;\theta]πθ(a∣s)=P[a∣s,;θ],
∇θJ(πθ)=∫Sρπ(s)∫A∇θπθ(s,a)Qπ(s,a)dads=Es∼ρπ,a∼πθ[∇θlogπθ(s,a)Qπ(s,a)]
\begin{aligned} \nabla_\theta J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\end{aligned}
∇θJ(πθ)==∫Sρπ(s)∫A∇θπθ(s,a)Qπ(s,a)dadsEs∼ρπ,a∼πθ[∇θlogπθ(s,a)Qπ(s,a)]
DPG
deterministic policy gradient
得出的Policy对于一个state的action是确定的
J(μθ)=∫Sρμ(s)r(s,μθ(s))ds=Es∼ρμ[r(s,μθ(s))]
\begin{aligned}J(\mu_\theta)=&\int_S \rho^\mu(s) r(s,\mu_\theta(s))ds\\=&E_{s\sim \rho^\mu}[r(s,\mu_\theta(s))]\end{aligned}
J(μθ)==∫Sρμ(s)r(s,μθ(s))dsEs∼ρμ[r(s,μθ(s))]
∇θJ(μθ)=∫Sρμ(s)∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)ds=Es∼ρμ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)] \begin{aligned}\nabla_\theta J(\mu_\theta)=&\int_S \rho^\mu(s) \nabla_\theta \mu_\theta (s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}ds\\=&E_{s\sim \rho^\mu}[\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]\end{aligned} ∇θJ(μθ)==∫Sρμ(s)∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)dsEs∼ρμ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)]
DDPG

策略梯度算法详解:从PolicyGradient到DPG与DDPG
本文深入解析了PolicyGradient、Stochastic Policy Gradient (SPG)、Deterministic Policy Gradient (DPG)和Deep Deterministic Policy Gradient (DDPG)等强化学习中的关键算法,通过数学公式阐述其工作原理和应用,适合理解策略优化在AI决策中的作用。
2377

被折叠的 条评论
为什么被折叠?



