[Reinforment Learning] Epoch and timestep

Preface

In the context of reinforcement learning (RL), the term “timestep” has a specific meaning that differs from “epoch.” Understanding these terms is crucial for interpreting how the RL algorithm operates and processes data.

Timestep:

Represents one discrete interaction: action → environment response (observation, reward, done signal).
Fundamental unit of experience in RL.

Epoch:

Represents a collection of timesteps, often aggregated before performing an update to the policy or value networks.
Helps in organizing the training process, especially in batch-based RL algorithms like PPO.
Why “Timestep” Matters:

Why timestep matters

RL algorithms rely on sequential data where each timestep’s outcome can influence future actions.
Tracking changes between consecutive timesteps (like delta_pitch) helps in understanding the dynamics and progression of the agent’s actions.

Epochs in RL:

While timesteps are about individual actions, epochs in RL organize these actions into manageable batches for updating the model.
For example, after collecting a certain number of timesteps, the agent may perform gradient updates to improve the policy based on the aggregated experience.

Visual Representation:

Epoch
│
├─ Timestep 1: Action A1 → Observation O1, Reward R1
├─ Timestep 2: Action A2 → Observation O2, Reward R2
├─ Timestep 3: Action A3 → Observation O3, Reward R3
│
└─ Policy Update based on collected timesteps
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值