Policy Network(Actor):
The policy network (actor) maps the current state of the system (e.g., robot joint angles, cube’s pose) to an action distribution. The agent samples actions from this distribution to interact with the environment.
Value Network (Critic):
The critic network uses the current state to estimate the value function (how good it is to be in that state). This helps the PPO algorithm update the policy more efficiently.
Likely Entry Point:
If you search through amp_models.py
or hrl_models.py
, you will often find a call like AMPBuilder.build(...)
or something similar. That’s where the network is instantiated from the builder. After this, the common_agent.py
or amp_players.py
code uses that constructed network to run through the RL pipeline.
CommonAgent (in common_agent.py
) is created and sets up the training run.
the RL process uses ModelAMPContinuous.build()
in amp_models.py
, which then calls self.network_builder.build('amp', **config)
to instantiate the policy network.
Flow of Data (Observations) During Training:
- The training loop (likely inside
Runner
or code called byRunner
) repeatedly:- Resets the environment(s).
- Retrieves Observations from the environment.
- Passes these observations to the Player (in this case,
AMPPlayerContinuous
). - The Player normalizes/preprocesses the observations if needed and then provides them to the Agent.
- The Agent (e.g.,
AMPAgent
) uses the Model (e.g.,ModelAMPContinuous
) which in turn calls the Network built byAMPBuilder
. - The Network takes the observation as input and produces action probabilities (the policy) and value estimates (the critic).
- The Agent selects actions, steps the environment, and the process repeats.<