π:S→P(A)\pi:\mathcal{S} \rightarrow \mathcal{P(A)}π:S→P(A)
at∈A=RNa_t\in\mathcal{A} = R^Nat∈A=RN
S\mathcal{S}S : state space
p(st+1∣st,at)p(s_{t+1}|s_t, a_t)p(st+1∣st,at)
r(st,at)r(s_t,a_t)r(st,at)
Rt=∑i=tTγ(i−t)r(si,ai)R_t = \sum_{i=t}^T\gamma^{(i-t)}r(s_i,a_i)Rt=∑i=tTγ(i−t)r(si,ai)
Discounted future reward形式的定义 + 递归形式的计算方式
Bellman Equation
Qμ(st,at)=Ert,st+1∼E[r(st,at)+γQπ(st+1,μ(st+1))]Q^{\mu}(s_t, a_t) = E_{r_t,s_{t+1}\sim E}[r(s_t,a_t) + \gamma Q^{\pi}(s_{t+1},\mu(s_{t+1}))]Qμ(st,at)=Ert,st+1∼E[r(st,at)+γQπ(st+1,μ(st+1))]
Q函数的训练目标:
Loss(θQ)=Est∼ρβ,at∼β,rt∼E[(Q(st,at∣θQ)−yt)2]Loss(\theta^{Q}) = E_{s_t\sim \rho^{\beta},a_t\sim \beta,r_t\sim E}[(Q(s_t, a_t|\theta^Q) - y_t)^2]Loss(θQ)=Est∼ρβ,at∼β,rt∼E[(Q(st,at∣θQ)−yt)2]
yt=r(st,at)+γQ(st+1,μ(st+1)∣θQ)y_t = r(s_t,a_t) + \gamma Q(s_{t+1}, \mu(s_{t+1})|\theta^Q)yt=r(st,at)+γQ(st+1,μ(st+1)∣θQ)
线上采用greedy的方式使用Q函数
不太使用户连续动作空间
actor funciton : μ(s∣θμ)\mu(s|\theta^{\mu})μ(s∣θμ)
critic fucntion : Q(s,a)Q(s, a)Q(s,a)
更新中…