TNNLS 2022
paper
code
Human-guided off-policy RL 应用在自动驾驶,其中RL采用的TD3。方法主要由三部分构成:
-
利用人类专家演示数据,在策略优化过程中通过BC正则的形式,实现策略优化的引导
L π ( ϕ ) = 1 N 1 ∑ i N 1 [ − Q ( s i , π ( ⋅ ∣ s i ; ϕ ) ; θ ) ] + 1 N 2 ∑ j N 2 [ ω ⋅ ∥ a j H − π ( ⋅ ∣ s j ; ϕ ) ∥ 2 2 ] , L Q ( θ ) = 1 N 1 ∑ i N 1 ∥ r i + γ Q ( s i + 1 , π ( ⋅ ∣ s i + 1 ) ; θ ) − Q ( s i , a i R L ; θ ) ∥ 2 2 + 1 N 2 ∑ j N 2 ∥ r j + γ Q ( s j + 1 , π ( ⋅ ∣ s j + 1 ) ; θ ) − Q ( s j , a j H ; θ ) ∥ 2 2 . \begin{aligned} \mathcal{L}^{\pi} (\phi)& =\frac{1}{N_{1}}\sum_{i}^{N_{1}}[-Q(\mathbf{s}_{i},\pi(\cdot|\mathbf{s}_{i};\phi);\theta)] +\frac{1}{N_{2}}\sum_{j}^{N_{2}}[\omega\cdot\|\mathbf{a}_{j}^{H}-\pi(\cdot|\mathbf{s}_{j};\phi)\|_{2}^{2}],\\ \mathcal{L}^{Q}(\theta)= &\frac{1}{N_{1}}\sum_{i}^{N_{1}}\|r_{i}+\gamma Q(\mathbf{s}_{i+1},\pi(\cdot|\mathbf{s}_{i+1});\theta)-Q(\mathbf{s}_{i},\mathbf{a}_{i}^{RL};\theta)\|_{2}^{2} \\ &+\frac1{N_2}\sum_j^{N_2}\|r_j+\gamma Q(\mathbf{s}_{j+1},\pi(\cdot|\mathbf{s}_{j+1});\theta)-Q(\mathbf{s}_j,\mathbf{a}_j^H;\theta)\|_2^2. \end{aligned} Lπ(ϕ)LQ(θ)==N11i∑N1[−Q(si,π(⋅∣si;ϕ);θ)]+N21j∑N2[ω⋅∥ajH−π(⋅∣sj;ϕ)∥22],N11i∑N1∥ri+γQ(si+1,π(⋅∣si+1);θ)−Q(si,aiRL;θ)∥22+N21j∑N2∥rj+γQ(sj+1,π(⋅∣sj+1);θ)−Q(sj,ajH;θ)∥22. -
在原有的PER的采样模式下,额外添加一项基于Q值的残差项,将人类演示引入,使得那些TD-errro大且远离演示的样本越容易被采样优化:
p i ≜ ∣ δ i T D ∣ + ε + exp [ Q ( s i , a i H ; θ ) − Q ( s i , π ( ⋅ ∣ s i ) ; θ ) ] , ( 13 \mathbf{p}_{i}\triangleq|\delta_{i}^{TD}|+\varepsilon+\exp\left[Q(\mathbf{s}_{i},\mathbf{a}_{i}^{H};\theta)-Q(\mathbf{s}_{i},\pi(\cdot|\mathbf{s}_{i});\theta)\right], (13 pi≜∣δiTD∣+ε+exp[Q(si,aiH;θ)−Q(si,π(⋅∣si);θ)],(13
Buffer中样本分布表示为:
p I ′ ( i ) = p i α ∑ k p k α . \mathbf{p}_{\mathcal{I}^{\prime}}(i)=\frac{\mathbf{p}_{i}^{\alpha}}{\sum_{k}\mathbf{p}_{k}^{\alpha}}. pI′(i)=∑kpkαpiα.
为避免优先机制导致Q值估计偏差,对样本采用如下的采样权重:
w I S ( i ) = [ p I ′ ( i ) ] − β . w_{IS}(i)=\left[\mathbf{p}_{\mathcal{I'}}(i)\right]^{-\beta}. wIS(i)=[pI′(i)]−β. -
人类干预下的reward-shaping,只有在第一次干预下才会进行,若连续干预,则除去第一次均不进行shaping.
r t s h a p e = r t + r p e n [ ( Δ t = I d i m ( A ) ∧ ( Δ t − 1 = 0 d i m ( A ) ) ] r_t^{\mathrm{shape}}=r_t+r_{\mathrm{pen}}[(\Delta_t=\mathbf{I}^{\mathrm{dim}(\mathcal{A})}\wedge(\Delta_{t-1}=\mathbf{0}^{\mathrm{dim}(\mathcal{A})})] rtshape=rt+rpen[(Δt=Idim(A)∧(Δt−1=0dim(A))]


435

被折叠的 条评论
为什么被折叠?



