Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add docs
  • Loading branch information
vwxyzjn authored Dec 20, 2022
commit 5f53fbe4ffd3b8b795e60ed326628064b64a2ff2
7 changes: 5 additions & 2 deletions docs/rl-algorithms/ppo.md
Original file line number Diff line number Diff line change
Expand Up @@ -580,11 +580,14 @@ Additionally, we record the following metric:

???+ info

Note that we use `charts/avg_episodic_return` in place of `charts/episodic_return` and `charts/episodic_length` because under the EnvPool's XLA interface, we can only record fixed-shape metrics where as there could be a variable number of raw episodic returns / lengths. To resolve this challenge, we create variables (e.g., `returned_episode_returns`, `returned_episode_lengths`) to keep track of the *latest* episodic returns / lengths of each environment and average them for reporting purposes.
Note that we use `charts/avg_episodic_return` and `charts/avg_episodic_length` in place of `charts/episodic_return` and `charts/episodic_length` because under the EnvPool's XLA interface, we can only record fixed-shape metrics where as there could be a variable number of raw episodic returns / lengths. To resolve this challenge, we create variables (e.g., `returned_episode_returns`, `returned_episode_lengths`) to keep track of the *latest* episodic returns / lengths of each environment and average them for reporting purposes.

### Implementation details

[ppo_atari_envpool_xla_jax.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py) uses a customized `RecordEpisodeStatistics` to work with EnvPool's experimental [XLA interface](https://envpool.readthedocs.io/en/latest/content/xla_interface.html) but has the same other implementation details as `ppo_atari.py` (see [related docs](/rl-algorithms/ppo/#implementation-details_1)) except that [ppo_atari_envpool_xla_jax.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py) does not use the value function clipping for simplicity.
[ppo_atari_envpool_xla_jax.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py) uses the same other implementation details as `ppo_atari.py` (see [related docs](/rl-algorithms/ppo/#implementation-details_1)), with two differences

1. [ppo_atari_envpool_xla_jax.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py) does not use the value function clipping by default, because there is no sufficient evidence that value function clipping actually improves performance.
1. [ppo_atari_envpool_xla_jax.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py) uses a customized `EpisodeStatistics` to record episode statistics instead of the `RecordEpisodeStatistics` used in other variants. `RecordEpisodeStatistics` is a *stateful* python wrapper which is incompatible with EnvPool's *stateless* XLA interface. To address this issue, we used a `EpisodeStatistics` dataclass and simply implement the logic of `RecordEpisodeStatistics`. However, `EpisodeStatistics` comes with a major limitation: its storage has a fixed shape and can only record the *latest* episodic return of the sub-environments. Furthermore, the default episodic return values in `EpisodeStatistics` are set to zeros, which does not necessarily correspond to the episodic return obtained by a random policy. For example, we would report `charts/avg_episodic_return=0` for `Pong-v5`, even if they should have been `charts/avg_episodic_return=-21`. That said, this issue goes away as soon as the sub-environments finished their first episodes, therefore not impacting the reported results.


???+ info
Expand Down