Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/scripts/gen_gifs.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,16 @@
frames = []
while True:
state = env.reset()
done = False
while not done and len(frames) <= LENGTH:
terminated = False
truncated = False
while not (terminated or truncated) and len(frames) <= LENGTH:

frame = env.render(mode='rgb_array')
repeat = int(60/env.metadata["render_fps"]) if env_type == "toy_text" else 1
for i in range(repeat):
frames.append(Image.fromarray(frame))
action = env.action_space.sample()
state_next, reward, done, info = env.step(action)
state_next, reward, terminated, truncated, info = env.step(action)

if len(frames) > LENGTH:
break
Expand Down
58 changes: 50 additions & 8 deletions docs/source/content/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,8 @@ to a specific point in space. If it succeeds in doing this (or makes some progre
alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress).
The agent will then be trained to maximize the reward it accumulates over many timesteps.

After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed! In that case,
we want to reset the environment to a new initial state. The environment issues a done signal to the agent if it enters such a terminal state.
Not all done signals must be triggered by a "catastrophic failure": Sometimes we also want to issue a done signal after
a fixed number of timesteps, or if the agent has succeeded in completing some task in the environment.
After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal.
This is a new change in API. Earlier a common done signal was issued for an episode ending via any means. This is now changed in favour of issuing two signals - terminated and truncated.

Let's see what the agent-environment loop looks like in Gym.
This example will run an instance of `LunarLander-v2` environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the environment
Expand All @@ -53,9 +51,9 @@ observation, info = env.reset(seed=42, return_info=True)

for _ in range(1000):
env.render()
observation, reward, done, info = env.step(env.action_space.sample())
observation, reward, terminated, truncated, info = env.step(env.action_space.sample())

if done:
if terminated or truncated:
observation, info = env.reset(return_info=True)

env.close()
Expand All @@ -73,6 +71,50 @@ the format of valid observations is specified by `env.observation_space`.
In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the
environment to ensure reproducible samples.

### Change in env.step API

Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated`.

Environment termination can happen due to any number of reasons inherent to the environment eg. task completition, failure etc. This is distinctly different from an episode of the environment ending due to a user-set time-limit, referred to as 'truncation'.

`terminated` signal is set to `True` when the core environment terminates inherently because of task completition, failure etc.
`truncated` signal is set to `True` when the episode ends specifically because of a time-limit not inherent to the environment.

It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step.

#### Motivation
This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However, this forces the truncation to happen only through a wrapper"

This is not true, any env can set info['TimeLimit.truncated'] = True...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course, my bad!


The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state. However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is more high-level than TD update: it is whether we treat the task as a finite or infinite horizon one.
See https://github.com/DLR-RM/stable-baselines3/blob/master/tests/test_gae.py#L136 for concrete example and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment! I thought about it a bit. Here's my understanding -

For infinite horizon tasks, the situation is simple - we always have to set practical time limits, so we need the terminated / truncated distinction here, where basically terminated is always False (and without the distinction, bootstrapping next state value is skipped leading to incorrect value estimates, this is also the test you linked).

For finite horizon tasks, it's a little bit tricky - this can be further split into two scenarios,

  1. The agent is aware of the timelimit (included in agent's observation) - here we need the distinction, suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit of 50 steps while training, and let's it run to the end during eval, this then needs a distinction, since the agent should not optimize for 50 step truncation. (So, at 1000 steps, we don't bootstrap, at 50 steps we bootstrap)
  2. The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

So it's not just about infinite vs finite horizon, but also about whether the agent is time-aware 🤔 since these lead to different situations. Please correct me if I'm wrong.

For the TD update, my point was to emphasize the reward + gamma*v(next_state) (when terminated=False, truncated=True) versus just reward (terminated=True). My specific example I think restricts to value based approaches, but a similar bootstrapping happens in almost all RL (In SB3 on-policy algos for eg. when terminated=True, the final reward is bootstrapped with the next value). Maybe I can update the example with this bootstrapping, instead of mentioning it as a TD error update?

Apologies for the random long comment, I hope this explanation is accurate. I will update the docs to be more clear and general.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where basically terminated is always False

Terminated could be true (for instance in case of a failure), but I would prefer to talk about termination reasons (see my comments in openai/gym#2510 (comment)), as in practice, an episode is terminated if you reach the timeout.

bootstrapping next state value is skipped leading to incorrect value estimates,

yes, unless you pass the remaining time as input (as described in https://arxiv.org/abs/1712.00378, what you call "time-aware" below).

suppose if the task timelimit is 1000 steps, and the agent is aware of this, and the user sets an additional episode time limit

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?
In your example, it feels that your new problem (50 steps limit) is treated as an infinite horizon problem when you bootstrap (because the agent never sees the real time limit). As in the infinite-horizon, the true horizon is dependent on the discount factor.

The only case where time-aware and infinite horizon is not equivalent for me is when you want your agent to do something special depending on the time, for instance a jump in the last steps.

The agent is not aware of the timelimit - here the distinction doesn't matter, since the agent is unaware of either timelimit, and we can always safely bootstrap.

In that case, you are breaking Markov assumption...

For the TD update,

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminated could be true (for instance in case of a failure)

Ohh oops yes, I meant terminated cannot be True due to a time limit here (given this PR's meaning and usage of terminated)

I'm not sure to get the example, in the sense that it does not make sense to me... do you have a concrete example where you want to set two time limits?

I was imagining a scenario where the agent had an age that was very long (say 10k steps) which it's aware of, but for the purpose of diversifying we wish to restart the episode from when the agent is at t>>0, and let it learn for 50 steps at at time. But it occurs to me that the age would no longer be considered as a timelimit (since it's no longer 10k steps from the start of the episode), rather a regular termination condition. So, my point is invalid 🤔 I think I also need to think more about this and what 'timelimit' means. I'll get back on the issue thread.

In that case, you are breaking Markov assumption...

Right, so if the agent has to optimize for a time-limit, it has to be time-aware to satisfy the markov assumption

My point was more to give a high-level reason of why we bootstrap or not, and dive into the details later.

Oh okay, that makes sense

is used for backup.
```python
vf_target = rew + gamma * (1-done)* vf(next_obs)
```
However, this leads to the next state backup being ignored during truncations as well which is incorrect. (See [link](https://arxiv.org/abs/1712.00378) for details)

Instead, using explicit `terminated` and `truncated` signals resolves this problem.

```python
# vf_target = rew + gamma * (1-done)* vf(next_obs) # wrong
vf_target = rew + gamma * (1-terminated)* vf(next_obs) # correct
```

#### Backward compatibility
Gym will retain support for the old API till v1.0 for ease of transition.

Users can toggle the old API through `make` by setting `return_two_dones=False`.

```python
env = gym.make("CartPole-v1", return_two_dones=False)
```
This can also be done explicitly through a wrapper:
```python
from gym.wrappers import StepCompatibility
env = StepCompatibility(CustomEnv(), return_two_dones=False)
```
For more details see the wrappers section.


## Standard methods

### Stepping
Expand Down Expand Up @@ -232,7 +274,7 @@ reward based on data in `info`). Such wrappers
can be implemented by inheriting from `Wrapper`.
Gym already provides many commonly used wrappers for you. Some examples:

- `TimeLimit`: Issue a done signal if a maximum number of timesteps has been exceeded (or the base environment has issued a done signal).
- `TimeLimit`: Issue a truncated signal if a maximum number of timesteps has been exceeded.
- `ClipAction`: Clip the action such that it lies in the action space (of type Box).
- `RescaleAction`: Rescale actions to lie in a specified interval
- `TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.
Expand Down Expand Up @@ -275,7 +317,7 @@ where we obtain the corresponding key ID constants from pygame. If the `key_to_a

Furthermore, you wish to plot real time statistics as you play, you can use `gym.utils.play.PlayPlot`. Here's some sample code for plotting the reward for last 5 second of gameplay:
```python
def callback(obs_t, obs_tp1, action, rew, done, info):
def callback(obs_t, obs_tp1, action, rew, terminated, truncated, info):
return [rew,]
plotter = PlayPlot(callback, 30 * 5, ["reward"])
env = gym.make("Pong-v0")
Expand Down
21 changes: 10 additions & 11 deletions docs/source/content/environment_creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ target on the grid that has been placed randomly at the beginning of the episode

- Observations provide the location of the target and agent.
- There are 4 actions in our environment, corresponding to the movements "right", "up", "left", and "down".
- A done signal is issued as soon as the agent has navigated to the grid cell where the target is located.
- A terminated signal is issued as soon as the agent has navigated to the grid cell where the target is located.
- Rewards are binary and sparse, meaning that the immediate reward is always zero, unless the agent has reached the target, then it is 1

An episode in this environment (with `size=5`) might look like this:
Expand Down Expand Up @@ -136,7 +136,7 @@ terms). In that case, we would have to update the dictionary that is returned by

### Reset
The `reset` method will be called to initiate a new episode. You may assume that the `step` method will not
be called before `reset` has been called. Moreover, `reset` should be called whenever a done signal has been issued.
be called before `reset` has been called. Moreover, `reset` should be called whenever a terminated or truncated signal has been issued.
Users may pass the `seed` keyword to `reset` to initialize any random number generator that is used by the environment
to a deterministic state. It is recommended to use the random number generator `self.np_random` that is provided by the environment's
base class, `gym.Env`. If you only use this RNG, you do not need to worry much about seeding, *but you need to remember to
Expand Down Expand Up @@ -168,9 +168,9 @@ and `_get_info` that we implemented earlier for that:

### Step
The `step` method usually contains most of the logic of your environment. It accepts an `action`, computes the state of
the environment after applying that action and returns the 4-tuple `(observation, reward, done, info)`.
Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `done`
accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `done`. To gather
the environment after applying that action and returns the 5-tuple `(observation, reward, terminated, truncated, info)`.
Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `terminated`
accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `terminated`. `truncated` is not set here. It is more convenient to set this through a wrapper as shown below. But if you prefer to not use a wrapper, you could also set it here. To gather
`observation` and `info`, we can again make use of `_get_obs` and `_get_info`:

```python
Expand All @@ -181,13 +181,13 @@ accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computi
self._agent_location = np.clip(
self._agent_location + direction, 0, self.size - 1
)
# An episode is done iff the agent has reached the target
done = np.array_equal(self._agent_location, self._target_location)
reward = 1 if done else 0 # Binary sparse rewards
# An episode is terminated iff the agent has reached the target
terminated = np.array_equal(self._agent_location, self._target_location)
reward = 1 if terminated else 0 # Binary sparse rewards
observation = self._get_obs()
info = self._get_info()

return observation, reward, done, info
return observation, reward, terminated, False, info
```

### Rendering
Expand Down Expand Up @@ -289,8 +289,7 @@ register(
```
The keyword argument `max_episode_steps=300` will ensure that GridWorld environments that are instantiated via `gym.make`
will be wrapped in a `TimeLimit` wrapper (see [the wrapper documentation](https://www.gymlibrary.ml/pages/wrappers/index)
for more information). A done signal will then be produced if the agent has reached the target *or* 300 steps have been
executed in the current episode. To distinguish truncation and termination, you can check `info["TimeLimit.truncated"]`.
for more information). A terminated signal will then be produced if the agent has reached the target. A `truncated` signal will be issued if 300 steps have been executed in the current episode.

Apart from `id` and `entrypoint`, you may pass the following additional keyword arguments to `register`:

Expand Down
38 changes: 22 additions & 16 deletions docs/source/content/vector_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The following example runs 3 copies of the ``CartPole-v1`` environment in parall
>>> envs = gym.vector.make("CartPole-v1", num_envs=3)
>>> envs.reset()
>>> actions = np.array([1, 0, 1])
>>> observations, rewards, dones, infos = envs.step(actions)
>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)

>>> observations
array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266],
Expand All @@ -31,7 +31,9 @@ array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266],
dtype=float32)
>>> rewards
array([1., 1., 1.])
>>> dones
>>> terminateds
array([False, False, False])
>>> truncateds
array([False, False, False])
>>> infos
({}, {}, {})
Expand Down Expand Up @@ -91,7 +93,7 @@ While standard Gym environments take a single action and return a single observa
dtype=float32)

>>> actions = np.array([1, 0, 1])
>>> observations, rewards, dones, infos = envs.step(actions)
>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)

>>> observations
array([[ 0.00187507, 0.18986781, -0.03168437, -0.301252 ],
Expand All @@ -100,7 +102,9 @@ While standard Gym environments take a single action and return a single observa
dtype=float32)
>>> rewards
array([1., 1., 1.])
>>> dones
>>> terminateds
array([False, False, False])
>>> truncateds
array([False, False, False])
>>> infos
({}, {}, {})
Expand All @@ -123,7 +127,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t
...
... def step(self, action):
... observation = self.observation_space.sample()
... return (observation, 0., False, {})
... return (observation, 0., False, False, {})

>>> envs = gym.vector.AsyncVectorEnv([lambda: DictEnv()] * 3)
>>> envs.observation_space
Expand All @@ -137,7 +141,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t
... "jump": np.array([0, 1, 0]),
... "acceleration": np.random.uniform(-1., 1., size=(3, 2))
... }
>>> observations, rewards, dones, infos = envs.step(actions)
>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
>>> observations
{"position": array([[-0.5337036 , 0.7439302 , 0.41748118],
[ 0.9373266 , -0.5780453 , 0.8987405 ],
Expand All @@ -152,8 +156,8 @@ The sub-environments inside a vectorized environment automatically call `gym.Env
>>> envs = gym.vector.make("FrozenLake-v1", num_envs=3, is_slippery=False)
>>> envs.reset()
array([0, 0, 0])
>>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 2]))
>>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 1]))
>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 2]))
>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 1]))

>>> dones
array([False, False, True])
Expand Down Expand Up @@ -201,7 +205,7 @@ This is convenient, for example, if you instantiate a policy. In the following e
... )
>>> observations = envs.reset()
>>> actions = policy(weights, observations).argmax(axis=1)
>>> observations, rewards, dones, infos = envs.step(actions)
>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)


## Intermediate Usage
Expand Down Expand Up @@ -235,11 +239,11 @@ Because sometimes things may not go as planned, the exceptions raised in sub-env
... if action == 1:
... raise ValueError("An error occurred.")
... observation = self.observation_space.sample()
... return (observation, 0., False, {})
... return (observation, 0., False, False, {})

>>> envs = gym.vector.AsyncVectorEnv([lambda: ErrorEnv()] * 3)
>>> observations = envs.reset()
>>> observations, rewards, dones, infos = envs.step(np.array([0, 0, 1]))
>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([0, 0, 1]))
ERROR: Received the following error from Worker-2: ValueError: An error occurred.
ERROR: Shutting down Worker-2.
ERROR: Raising the last exception back to the main process.
Expand Down Expand Up @@ -272,15 +276,15 @@ In the following example, we create a new environment `SMILESEnv`, whose observa
...
... def step(self, action):
... self._state += self.observation_space.symbols[action]
... reward = done = (action == 0)
... return (self._state, float(reward), done, {})
... reward = terminated = (action == 0)
... return (self._state, float(reward), terminated, False, {})

>>> envs = gym.vector.AsyncVectorEnv(
... [lambda: SMILESEnv()] * 3,
... shared_memory=False
... )
>>> envs.reset()
>>> observations, rewards, dones, infos = envs.step(np.array([2, 5, 4]))
>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([2, 5, 4]))
>>> observations
('[(', '[O', '[C')
```
Expand Down Expand Up @@ -352,7 +356,7 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar
>>> envs = gym.vector.make("CartPole-v1", num_envs=3)
>>> envs.reset()
>>> actions = np.array([1, 0, 1])
>>> observations, rewards, dones, infos = envs.step(actions)
>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)

>>> observations
array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266],
Expand All @@ -361,7 +365,9 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar
dtype=float32)
>>> rewards
array([1., 1., 1.])
>>> dones
>>> terminateds
array([False, False, False])
>>> truncateds
array([False, False, False])
>>> infos
({}, {}, {})
Expand Down
Loading