DPR

最新推荐文章于 2024-11-10 01:47:45 发布

原创最新推荐文章于 2024-11-10 01:47:45 发布 · 591 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#DPR

论文笔记专栏收录该内容

2 篇文章

订阅专栏

Actor：

language model

P (a t | a 0, \dots, a t - 1) = P (a t | c t) (1)

$\begin{equation} P(a_t | a_0, \dots, a_{t-1}) = P(a_t | c_t) \end{equation}$

观测序列

o ¯ = \sum i = 1 n ϕ (o i) (2)

$\begin{equation} \bar{o} = \sum \limits_{i=1}^n \phi(o_i) \end{equation}$
根据

o¯o¯ $\bar{o}$ 确定最相似的plan的类别，确定观测序列所属的主题

TT $T$ ，然后在主题

T

$T$ 中选出最相似的

kk $k$ 个plan

\begin{matrix} (3) & \bar{s} = T o p K S i m (p_{i j}, \bar{o}) \tilde{s} = \sum_{i} α_{i} s_{i} s_{i} \in \bar{s} α_{i} = p_{i} ⊙ \bar{o} \tilde{o_{t}} = f (\bar{o}, a_{t}) s_{t} = c o n c a t e n a t e (\tilde{o_{t}}, \tilde{s_{t}}) \end{matrix}

$\begin{equation} \bar{s} = TopKSim(p_{ij}, \bar{o})\\ \tilde{s} = \sum_i \alpha_i s_i\\ s_i \in \bar{s}\\ \alpha_i = p_i \odot \bar{o}\\ \tilde{o_t} = f(\bar{o}, a_t)\\ s_t = concatenate(\tilde{o_t}, \tilde{s_t}) \end{equation}$

$E[R_{1:\infty}]$

\partial E [ R 1 : \infty ] \partial θ = = E [\partial \partial θ log π (a | s) (Q π (s, a) - V π (s))] E [\partial \partial θ log P (a | s) (Q π (s, a) - \sum a Q π (s, a))] (72)

$\begin{eqnarray} \frac{\partial E[R_{1:\infty}]}{\partial \theta} &=& E \left[ \frac{\partial}{\partial \theta} \log \pi(a|s)(Q^\pi(s,a) - V^\pi(s)) \right]\nonumber\\ &=& E \left[ \frac{\partial}{\partial \theta} \log P(a|s)(Q^\pi(s,a) - \sum_a Q^\pi(s,a)) \right] \end{eqnarray}$