强化学习_Paper_1983_Neuronlike adaptive elements that can solve difficult learning control problems

paper Link: sci-hub: Neuronlike adaptive elements that can solve difficult learning control problems

1 摘要

通过两个类似神经元的自适应元素组成的系统解决一个复杂的控制学习问题。
  • 研究环境: Cart-pole (和gym的classic_control/cart_pole 类似)
  • 算法: ASE + ACE
    • associative search element (ASE) : 强化输入与输出之间的关联
    • adaptive critic element (ACE):构建一个比单独的强化反馈更有信息量的评估函数 r ^ t \hat r_t r^t
  • 主要贡献:
    • 自适应元素的能力:ASE 和 ACE 的结合能够解决复杂的控制学习问题,即使在反馈信号质量较低的情况下
    • 对神经科学的启示:论文提出,如果生物网络中的组件也具有类似的自适应能力,那么可能存在与本文描述的自适应元素相似的神经元
    • 对人工智能的启示:这种自适应元素的设计为构建能够解决复杂问题的网络提供了一种新的方法

2 核心算法

笔者会从现在RL的一些做法角度回看
在这里插入图片描述

2.1 ASE

用于优化搜索过程,通过强化输入输出关联来减少搜索空间,提高搜索效率;

能够努力实现有益的事件,并避免随时可能发生的惩罚事件(ASE, on the other hand, is capable of working to achieve rewarding events and to avoid punishing events which might occur at any time)

predict: y t = f [ ∑ i n w t i x t i + n o i s e t ] y_t = f[\sum_i^n w^i_t x^i_t + noise_t] yt=f[inwtixti+noiset] … (1)

  • y t ∈ { − 1 , 1 } y_t \in \{-1, 1\} yt{1,1}
  • n o i s e t ∼ N ( 0 , σ 2 ) noise_t \sim N(0, \sigma^2) noisetN(0,σ2)
  • f:可以是阈值函数,sigmoid函数,或恒等函数
    • 阈值函数: f = { 1 , i f   x ≥ 0   ( control action right ) − 1 , i f   x < 0   ( control action left ) f = \begin{cases} 1, if\ x \ge 0 \ (\text{control action right})\\ -1, if\ x \lt 0 \ (\text{control action left}) \end{cases} f={1,if x0 (control action right)1,if x<0 (control action left)

update: w t + 1 i = w t i + α r ^ t e t i w^i_{t+1} = w^i_t + \alpha \hat r_t e^i_t wt+1i=wti+αr^teti … (2)

  • α \alpha α: 学习率
  • r ^ t \hat r_t r^t: 强化值 Fig-3
  • e t i e^i_t eti: eligibility e t + 1 i = δ e t i + ( 1 − δ ) y t i x t i e^i_{t+1} = \delta e^i_{t} + (1-\delta)y_t^ix_t^i et+1i=δeti+(1δ)ytixti
  • 近似Q: m i n i m u n _ m s e = 0.5 ( y t a r − Q ( s , a ) ) 2 minimun\_mse = 0.5(y_{tar} - Q(s, a))^2 minimun_mse=0.5(ytarQ(s,a))2
    • ∂ l ∂ w = ∂ l ∂ Q ∂ Q ∂ w x x = { − ( y t a r − Q ( s , a ) ) x ;   i f   w x < 0 ; ( y t a r − Q ( s , a ) ) x ;   i f   w x ≥ 0 \frac{\partial l}{\partial w}=\frac{\partial l}{\partial Q}\frac{\partial Q}{\partial wx}x=\begin{cases} -(y_{tar} - Q(s, a))x; \ if \ wx \lt 0 ;\\ (y_{tar} - Q(s, a))x; \ if \ wx \ge 0 \end{cases} wl=QlwxQx={(ytarQ(s,a))x; if wx<0;(ytarQ(s,a))x; if wx0
      • y t a r − Q ( s , a ) ≃ r ^ ;   y = { − 1   i f   w x < 0 ; 1 ;   i f   w x ≥ 0 y_{tar} - Q(s, a) \simeq \hat r; \ y=\begin{cases} -1 \ if \ wx \lt 0 ; \\ 1; \ if \ wx \ge 0 \end{cases} ytarQ(s,a)r^; y={1 if wx<0;1; if wx0
        • 因为产出 y t y_t yt的函数直接输出动作,无法评估state的价值
      • = r ^ y x = r ^ e t =\hat r yx=\hat r e_t =r^yx=r^et

The basic idea expressed by (2) is that whenever certain conditions (to be discussed later) hold for input pathway i,
then that pathway becomes eligible to have its weight modified,
and it remains eligible for some period of time after the conditions cease to hold.

If the reinforcement indicates improved performance, then the weights of the eligible pathways are changed so as to make the element more likely to do whatever it did that made those pathways eligible.

  • r ^ \hat{r} r^更大,即新状态下预估的 U t U_t Ut更大

If reinforcement indicates decreased performance, then the weights of the eligible pathways are changed to make the element more likely to do something else

  • r ^ \hat{r} r^更小,即新状态下预估的 U t U_t Ut更小
class ASE:
    def __init__(self, input_dim, alpha=0.1, delta=0.2, sigma=0.01):
        """associative search element"""
        self.w = np.random.randn(input_dim, 1)
        self.lr = alpha
        self.et = 0
        self.delta = delta
        self.sigma = sigma

    def forward(self, state):
        noise = np.random.normal(0, self.sigma) # 增加随机探索
        return  self.f_func(np.dot(state, self.w) + noise)
    
    def f_func(self, a):
        if a < 0:
            return 0
        return 1
    
    def action_postfix(self, a):
        if a == 0:
            return -1
        return 1      

    def update(self, s, hat_r):
        # 1- conditions of action: becomes eligible to have its weight modified
        e_t = self.action_postfix(self.forward(s)) * s
        # w^i_{t+1} = w^i_t + \alpha r_t e^i_t
        self.w += (self.lr * hat_r * e_t).reshape(self.w.shape)
        # 2- it remains eligible for some period of time after the conditions cease to hold
        # e^i_{t+1} = \delta e^i_{t} + (1-\delta)y_t^ix_t^i
        self.et = self.delta * self.et + (1 - self.delta) * e_t

2.2 ACE

The central idea behind the ACE algorithm is predictions are formed that predict not just reinforcement but also future predictions of reinforcement

predict: p t = ∑ i n v t i x t i p_t=\sum_i^n v^i_t x^i_t pt=invtixti

update: v t + 1 i = v t i + β r ^ t x ‾ t i v^i_{t+1} = v^i_t + \beta \hat r_t \overline{x}^i_t vt+1i=vti+βr^txti

  • r t ∈ { 0 , − 1 } r_t \in \{0, -1\} rt{0,1}
  • 近似V: u t = E [ ∑ a A Q ( s , a ) ] = E [ V ( s ) ] = r + v t + 1 = v t u_t=E[\sum_a^AQ(s, a)]=E[V(s)]=r+v_{t+1}=v_{t} ut=E[aAQ(s,a)]=E[V(s)]=r+vt+1=vt TDError MSE:
    • ∂ 0.5 r ^ t 2 ∂ v = ∂ 0.5 r ^ t 2 ∂ p t ∂ p t ∂ v = r ^ t x t \frac{\partial 0.5\hat r_t^2 }{\partial v}=\frac{\partial 0.5\hat r_t^2 }{\partial p_t}\frac{\partial p_t}{\partial v}=\hat r_t x_t v0.5r^t2=pt0.5r^t2vpt=r^txt
  • reinforcement signal: r ^ t = r t + γ p t − p t − 1 \hat r_t=r_t + \gamma p_t - p_{t-1} r^t=rt+γptpt1
    • like TDError
  • x ‾ t + 1 i = λ x ‾ t i + ( 1 − λ ) x t i \overline{x}^i_{t+1} = \lambda \overline{x}^i_t + ( 1 - \lambda) x^i_t xt+1i=λxti+(1λ)xti
    • 收敛的时候 r ^ t   = 0 \hat r_t~=0 r^t =0 ACE和ASE都停止迭代

在这里插入图片描述

algo Prove:

  • (1982) Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element
  • (1981) Toward a modern theory of adaptive networks: Expectation and prediction

class ACE:
    """adaptive critic element"""
    def __init__(self, input_dim, beta=0.1, gamma=0.95, lmbda=0.8):
        self.v = np.random.randn(input_dim)
        self.lr = beta
        self.gamma = gamma
        self.lmbda = lmbda
        self.overline_x = np.zeros(input_dim)
    
    def forward(self, state):
        return np.dot(state, self.v) 
    
    def reward_postfix(self, r):
        return r - 1
    
    def reinforcement_signal(self, r, s, before_s):
        r = self.reward_postfix(r) # 0 -1
        return r + self.gamma * self.forward(s) - self.forward(before_s)

    def update(self, s, r, before_s):
        # r_t + \gamma p_t - p_{t-1}
        hat_r = self.reinforcement_signal(r, s, before_s)
        
        # v^i_{t+1} = v^i_t + \beta \overline{x}_i
        self.v += self.lr * hat_r * self.overline_x 

        # \overline{x}^i_{t+1} = \lambda \overline{x}^i_t + ( 1 - \lambda) x^i_t
        self.overline_x = self.lmbda * self.overline_x + (1 - self.lmbda) * s
        return hat_r

2.3 Train Test

# 创建环境
seed_ = 19831983
np.random.seed(seed_)
env = gym.make('CartPole-v1')
input_dim = env.observation_space.shape[0]
output_dim = env.action_space.n

 
# paper 1000
ase = ASE(input_dim, alpha=580, delta=0.9, sigma=0.01) 
ace = ACE(input_dim, beta=0.5, gamma=0.95, lmbda=0.8)

num_episodes = 1200 # 500000
reward_l = []
tq_bar = tqdm(range(num_episodes))
for episode in tq_bar:
    s, _ = env.reset(seed=20250314)   # 500
    done = False
    r_tt = 0
    while not done:
        a = ase.forward(s)  
        n_s, r, terminated, truncated, infos = env.step(a) 
        r_tt += r
        # 更新 ACE
        hat_r = ace.update(n_s, r, s)
        # 更新 ASE
        ase.update(n_s, hat_r)
        s = n_s
        done = terminated or truncated

    reward_l.append(r_tt)
    tq_bar.set_postfix({
        'rewards': r_tt,
        'last_mean': np.mean(reward_l[-10:]),
        'last_std': np.std(reward_l[-10:]),
    })

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Scc_hy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值