ptan实战3 || ExperienceSource类

原创已于 2022-04-10 10:44:19 修改 · 470 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #pytorch #神经网络

于 2022-03-25 16:23:54 首次发布

本文介绍PTAN库中ExperienceSource类的使用方法，通过创建简单的环境和智能体，演示了如何利用ExperienceSource和ExperienceSourceFirstLast类进行交互并收集经验。

ptan实战3 || ExperienceSource类

ExperienceSource类返回(s,a,r,done)，需传入(env,agent)，agent分为DQNAgent、PolicyAgent，agent类需要传入net、selector

ExperienceSourceFirstLast 类返回(s,a,r,s_)

该函数会自动让agent与env互动

环境的reset()被调用，得到初始状态
agent从返回的状态中选择要执行的动作
执行step()方法，获得奖励和下一个状态s_
s_传入agent得到下一个动作
返回状态s转移到s_获得的r和info
从第三步开始重复，一直到经验源被遍历完

注：steps_count=2 表示智能体会每次迭代2步，返回的 (s,a,r,s_) 中的r 不是立即奖励，而是 r+γr ；s_ 也不是执行一步后的状态，而是执行两步后的状态

import gym
import ptan
from typing import List, Optional, Tuple, Any

class ToyEnv(gym.Env):

    def __init__(self):
        super(ToyEnv, self).__init__()
        self.observation_space = gym.spaces.Discrete(n=5)   # 定义状态空间为discrete类型 范围0~4
        self.action_space = gym.spaces.Discrete(n=3)        # 定义动作空间为discrete类型 范围0~2
        self.step_index = 0

    def reset(self):    # 初始化状态，计数器置零
        self.step_index = 0
        return self.step_index

    def step(self, action):
        is_done = self.step_index == 10     # 计数器到10，游戏结束
        if is_done:     # 游戏结束
            # 返回(s,r,done,info)
            return self.step_index % self.observation_space.n, \
                   0.0, is_done, {}

        # 游戏未结束 返回(s_,r,done,info)
        self.step_index += 1
        return self.step_index % self.observation_space.n, \
               float(action), False, {}

# 无论观察是什么，都返回同一个动作的agent，该动作定义为 int 1
class DullAgent(ptan.agent.BaseAgent):

    def __init__(self, action: int):
        self.action = action

    # Optional[List]=None 表示默认为List类型，且可以为None；Tuple表示元组
    def __call__(self, observations: List[Any], state: Optional[List] = None) \
            -> Tuple[List[int], Optional[List]]:
        return [self.action for _ in observations], state

if __name__ == "__main__":
    env = ToyEnv()
    s = env.reset()
    print("env.reset() -> %s" % s)
    s = env.step(1)
    print("env.step(1) -> %s" % str(s))
    s = env.step(2)
    print("env.step(2) -> %s" % str(s))

    for _ in range(10):
        r = env.step(0)
        print(r)

    agent = DullAgent(action=1)
    print("agent:", agent([1, 2])[0])

    env = ToyEnv()
    agent = DullAgent(action=1)
    # ExperienceSource 返回(s,a,r,done)
    exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=2)
    for idx, exp in enumerate(exp_source):
        if idx > 15:                    # 收集15次长度为2的经验条
            break
        print('exp=>',exp)

    exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=4)
    print(next(iter(exp_source)))       # 只收集一次长度为4的经验条，当遇到最后一个片段结束时 长度可小于4

    exp_source = ptan.experience.ExperienceSource(env=[ToyEnv(), ToyEnv()], agent=agent, steps_count=4)
    for idx, exp in enumerate(exp_source):
        if idx > 4:
            break
        print(exp)

    print("ExperienceSourceFirstLast")
    # steps_count=3 表示迭代中的三步压缩在一起，返回一个总奖励
    # gamma=0.8 表示第二步的奖励为第一步的0.8倍，第三步的奖励为第二步的0.8倍
    # 但是如果某一步游戏结束，会返回剩余步数的累计奖励，比如第一步就已经结束奖励为0，会返回第二步和第三步的累计奖励非0
    # ExperienceSourceFirstLast 返回(s,a,r,s_)
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=0.8, steps_count=3)
    for idx, exp in enumerate(exp_source):
        print(exp)
        if idx > 10:
            break