[Reinforment Learning] Epoch and timestep-优快云博客

本文链接：https://blog.youkuaiyun.com/likecayon/article/details/144617134

Preface

In the context of reinforcement learning (RL), the term “timestep” has a specific meaning that differs from “epoch.” Understanding these terms is crucial for interpreting how the RL algorithm operates and processes data.

Timestep:

Represents one discrete interaction: action → environment response (observation, reward, done signal).
Fundamental unit of experience in RL.

Epoch:

Represents a collection of timesteps, often aggregated before performing an update to the policy or value networks.
Helps in organizing the training process, especially in batch-based RL algorithms like PPO.
Why “Timestep” Matters:

Why timestep matters

RL algorithms rely on sequential data where each timestep’s outcome can influence future actions.
Tracking changes between consecutive timesteps (like delta_pitch) helps in understanding the dynamics and progression of the agent’s actions.

Epochs in RL:

While timesteps are about individual actions, epochs in RL organize these actions into manageable batches for updating the model.
For example, after collecting a certain number of timesteps, the agent may perform gradient updates to improve the policy based on the aggregated experience.

Visual Representation:

Epoch
│
├─ Timestep 1: Action A1 → Observation O1, Reward R1
├─ Timestep 2: Action A2 → Observation O2, Reward R2
├─ Timestep 3: Action A3 → Observation O3, Reward R3
│
└─ Policy Update based on collected timesteps