Farama-Foundation · arjun-kg · Apr 14, 2022 · Apr 14, 2022 · araffin · Apr 17, 2022
diff --git a/docs/scripts/gen_gifs.py b/docs/scripts/gen_gifs.py
@@ -57,15 +57,16 @@
         frames = []
         while True:
             state = env.reset()
-            done = False
-            while not done and len(frames) <= LENGTH:
+            terminated = False
+            truncated = False 
+            while not (terminated or truncated) and len(frames) <= LENGTH:
 
                 frame = env.render(mode='rgb_array')
                 repeat = int(60/env.metadata["render_fps"]) if env_type == "toy_text" else 1
                 for i in range(repeat):
                     frames.append(Image.fromarray(frame))
                 action = env.action_space.sample()
-                state_next, reward, done, info = env.step(action)
+                state_next, reward, terminated, truncated, info = env.step(action)
 
             if len(frames) > LENGTH:
                 break

diff --git a/docs/source/content/api.md b/docs/source/content/api.md
@@ -36,10 +36,8 @@ to a specific point in space. If it succeeds in doing this (or makes some progre
 alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress). 
 The agent will then be trained to maximize the reward it accumulates over many timesteps.
 
-After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed! In that case,
-we want to reset the environment to a new initial state. The environment issues a done signal to the agent if it enters such a terminal state.
-Not all done signals must be triggered by a "catastrophic failure": Sometimes we also want to issue a done signal after
-a fixed number of timesteps, or if the agent has succeeded in completing some task in the environment.
+After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal.
+This is a new change in API. Earlier a common done signal was issued for an episode ending via any means. This is now changed in favour of issuing two signals - terminated and truncated.
 
 Let's see what the agent-environment loop looks like in Gym.
 This example will run an instance of `LunarLander-v2` environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the environment
@@ -53,9 +51,9 @@ observation, info = env.reset(seed=42, return_info=True)
 
 for _ in range(1000):
     env.render()
-    observation, reward, done, info = env.step(env.action_space.sample())
+    observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
 
-    if done:
+    if terminated or truncated:
         observation, info = env.reset(return_info=True)
 
 env.close()
@@ -73,6 +71,50 @@ the format of valid observations is specified by `env.observation_space`.
 In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the 
 environment to ensure reproducible samples.
 
+### Change in env.step API
+
+Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated`.
+
+Environment termination can happen due to any number of reasons inherent to the environment eg. task completition, failure etc. This is distinctly different from an episode of the environment ending due to a user-set time-limit, referred to as 'truncation'. 
+
+`terminated` signal is set to `True` when the core environment terminates inherently because of task completition, failure etc. 
+`truncated` signal is set to `True` when the episode ends specifically because of a time-limit not inherent to the environment. 
+
+It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step. 
+
+#### Motivation 
+This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. 
+
+The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state.  However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state 
+is used for backup. 
+```python
+vf_target = rew + gamma * (1-done)* vf(next_obs)
+```
+However, this leads to the next state backup being ignored during truncations as well which is incorrect. (See [link](https://arxiv.org/abs/1712.00378) for details)
+
+Instead, using explicit `terminated` and `truncated` signals resolves this problem.
+
+```python
+# vf_target = rew + gamma * (1-done)* vf(next_obs) # wrong 
+vf_target = rew + gamma * (1-terminated)* vf(next_obs) # correct
+```
+
+#### Backward compatibility
+Gym will retain support for the old API till v1.0 for ease of transition. 
+
+Users can toggle the old API through `make` by setting `return_two_dones=False`. 
+
+```python
+env = gym.make("CartPole-v1", return_two_dones=False)
+```
+This can also be done explicitly through a wrapper: 
+```python
+from gym.wrappers import StepCompatibility
+env = StepCompatibility(CustomEnv(), return_two_dones=False)
+```
+For more details see the wrappers section. 
+
+
 ## Standard methods
 
 ### Stepping
@@ -232,7 +274,7 @@ reward based on data in `info`). Such wrappers
 can be implemented by inheriting from `Wrapper`.
 Gym already provides many commonly used wrappers for you. Some examples:
 
-- `TimeLimit`: Issue a done signal if a maximum number of timesteps has been exceeded (or the base environment has issued a done signal).
+- `TimeLimit`: Issue a truncated signal if a maximum number of timesteps has been exceeded.
 - `ClipAction`: Clip the action such that it lies in the action space (of type Box).
 - `RescaleAction`: Rescale actions to lie in a specified interval
 - `TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.
@@ -275,7 +317,7 @@ where we obtain the corresponding key ID constants from pygame. If the `key_to_a
 
 Furthermore, you wish to plot real time statistics as you play, you can use `gym.utils.play.PlayPlot`. Here's some sample code for plotting the reward for last 5 second of gameplay:
 ```python
-def callback(obs_t, obs_tp1, action, rew, done, info):
+def callback(obs_t, obs_tp1, action, rew, terminated, truncated, info):
     return [rew,]
 plotter = PlayPlot(callback, 30 * 5, ["reward"])
 env = gym.make("Pong-v0")

diff --git a/docs/source/content/environment_creation.md b/docs/source/content/environment_creation.md
@@ -43,7 +43,7 @@ target on the grid that has been placed randomly at the beginning of the episode
 
 - Observations provide the location of the target and agent. 
 - There are 4 actions in our environment, corresponding to the movements "right", "up", "left", and "down".  
-- A done signal is issued as soon as the agent has navigated to the grid cell where the target is located.
+- A terminated signal is issued as soon as the agent has navigated to the grid cell where the target is located.
 - Rewards are binary and sparse, meaning that the immediate reward is always zero, unless the agent has reached the target, then it is 1
 
 An episode in this environment (with `size=5`) might look like this:
@@ -136,7 +136,7 @@ terms). In that case, we would have to update the dictionary that is returned by
 
 ### Reset
 The `reset` method will be called to initiate a new episode. You may assume that the `step` method will not
-be called before `reset` has been called. Moreover, `reset` should be called whenever a done signal has been issued.
+be called before `reset` has been called. Moreover, `reset` should be called whenever a terminated or truncated signal has been issued.
 Users may pass the `seed` keyword to `reset` to initialize any random number generator that is used by the environment
 to a deterministic state. It is recommended to use the random number generator `self.np_random` that is provided by the environment's
 base class, `gym.Env`. If you only use this RNG, you do not need to worry much about seeding, *but you need to remember to
@@ -168,9 +168,9 @@ and `_get_info` that we implemented earlier for that:
 
 ### Step
 The `step` method usually contains most of the logic of your environment. It accepts an `action`, computes the state of 
-the environment after applying that action and returns the 4-tuple `(observation, reward, done, info)`.
-Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `done`
-accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `done`. To gather
+the environment after applying that action and returns the 5-tuple `(observation, reward, terminated, truncated, info)`.
+Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `terminated`
+accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `terminated`. `truncated` is not set here. It is more convenient to set this through a wrapper as shown below. But if you prefer to not use a wrapper, you could also set it here. To gather
 `observation` and `info`, we can again make use of `_get_obs` and `_get_info`:
 
 ```python
@@ -181,13 +181,13 @@ accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computi
         self._agent_location = np.clip(
             self._agent_location + direction, 0, self.size - 1
         )
-        # An episode is done iff the agent has reached the target
-        done = np.array_equal(self._agent_location, self._target_location)
-        reward = 1 if done else 0  # Binary sparse rewards
+        # An episode is terminated iff the agent has reached the target
+        terminated = np.array_equal(self._agent_location, self._target_location)
+        reward = 1 if terminated else 0  # Binary sparse rewards
         observation = self._get_obs()
         info = self._get_info()
 
-        return observation, reward, done, info
+        return observation, reward, terminated, False, info
 ```
 
 ### Rendering
@@ -289,8 +289,7 @@ register(
 ```
 The keyword argument `max_episode_steps=300` will ensure that GridWorld environments that are instantiated via `gym.make`
 will be wrapped in a `TimeLimit` wrapper (see [the wrapper documentation](https://www.gymlibrary.ml/pages/wrappers/index) 
-for more information). A done signal will then be produced if the agent has reached the target *or* 300 steps have been
-executed in the current episode. To distinguish truncation and termination, you can check `info["TimeLimit.truncated"]`.
+for more information). A terminated signal will then be produced if the agent has reached the target. A `truncated` signal will be issued if 300 steps have been executed in the current episode. 
 
 Apart from `id` and `entrypoint`, you may pass the following additional keyword arguments to `register`:
 

diff --git a/docs/source/content/vector_api.md b/docs/source/content/vector_api.md
@@ -22,7 +22,7 @@ The following example runs 3 copies of the ``CartPole-v1`` environment in parall
 >>> envs = gym.vector.make("CartPole-v1", num_envs=3)
 >>> envs.reset()
 >>> actions = np.array([1, 0, 1])
->>> observations, rewards, dones, infos = envs.step(actions)
+>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
 
 >>> observations
 array([[ 0.00122802,  0.16228443,  0.02521779, -0.23700266],
@@ -31,7 +31,9 @@ array([[ 0.00122802,  0.16228443,  0.02521779, -0.23700266],
         dtype=float32)
 >>> rewards
 array([1., 1., 1.])
->>> dones
+>>> terminateds
+array([False, False, False])
+>>> truncateds
 array([False, False, False])
 >>> infos
 ({}, {}, {})
@@ -91,7 +93,7 @@ While standard Gym environments take a single action and return a single observa
           dtype=float32)
 
     >>> actions = np.array([1, 0, 1])
-    >>> observations, rewards, dones, infos = envs.step(actions)
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
 
     >>> observations
     array([[ 0.00187507,  0.18986781, -0.03168437, -0.301252  ],
@@ -100,7 +102,9 @@ While standard Gym environments take a single action and return a single observa
           dtype=float32)
     >>> rewards
     array([1., 1., 1.])
-    >>> dones
+    >>> terminateds
+    array([False, False, False])
+    >>> truncateds
     array([False, False, False])
     >>> infos
     ({}, {}, {})
@@ -123,7 +127,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t
     ...
     ...     def step(self, action):
     ...         observation = self.observation_space.sample()
-    ...         return (observation, 0., False, {})
+    ...         return (observation, 0., False, False, {})
 
     >>> envs = gym.vector.AsyncVectorEnv([lambda: DictEnv()] * 3)
     >>> envs.observation_space
@@ -137,7 +141,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t
     ...     "jump": np.array([0, 1, 0]),
     ...     "acceleration": np.random.uniform(-1., 1., size=(3, 2))
     ... }
-    >>> observations, rewards, dones, infos = envs.step(actions)
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
     >>> observations
     {"position": array([[-0.5337036 ,  0.7439302 ,  0.41748118],
                         [ 0.9373266 , -0.5780453 ,  0.8987405 ],
@@ -152,8 +156,8 @@ The sub-environments inside a vectorized environment automatically call `gym.Env
     >>> envs = gym.vector.make("FrozenLake-v1", num_envs=3, is_slippery=False)
     >>> envs.reset()
     array([0, 0, 0])
-    >>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 2]))
-    >>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 1]))
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 2]))
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 1]))
 
     >>> dones
     array([False, False,  True])
@@ -201,7 +205,7 @@ This is convenient, for example, if you instantiate a policy. In the following e
     ... )
     >>> observations = envs.reset()
     >>> actions = policy(weights, observations).argmax(axis=1)
-    >>> observations, rewards, dones, infos = envs.step(actions)
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
 
 
 ## Intermediate Usage
@@ -235,11 +239,11 @@ Because sometimes things may not go as planned, the exceptions raised in sub-env
     ...         if action == 1:
     ...             raise ValueError("An error occurred.")
     ...         observation = self.observation_space.sample()
-    ...         return (observation, 0., False, {})
+    ...         return (observation, 0., False, False, {})
 
     >>> envs = gym.vector.AsyncVectorEnv([lambda: ErrorEnv()] * 3)
     >>> observations = envs.reset()
-    >>> observations, rewards, dones, infos = envs.step(np.array([0, 0, 1]))
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([0, 0, 1]))
     ERROR: Received the following error from Worker-2: ValueError: An error occurred.
     ERROR: Shutting down Worker-2.
     ERROR: Raising the last exception back to the main process.
@@ -272,15 +276,15 @@ In the following example, we create a new environment `SMILESEnv`, whose observa
 ...
 ...     def step(self, action):
 ...         self._state += self.observation_space.symbols[action]
-...         reward = done = (action == 0)
-...         return (self._state, float(reward), done, {})
+...         reward = terminated = (action == 0)
+...         return (self._state, float(reward), terminated, False, {})
 
 >>> envs = gym.vector.AsyncVectorEnv(
 ...     [lambda: SMILESEnv()] * 3,
 ...     shared_memory=False
 ... )
 >>> envs.reset()
->>> observations, rewards, dones, infos = envs.step(np.array([2, 5, 4]))
+>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([2, 5, 4]))
 >>> observations
 ('[(', '[O', '[C')
 ```
@@ -352,7 +356,7 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar
     >>> envs = gym.vector.make("CartPole-v1", num_envs=3)
     >>> envs.reset()
     >>> actions = np.array([1, 0, 1])
-    >>> observations, rewards, dones, infos = envs.step(actions)
+    >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions)
 
     >>> observations
     array([[ 0.00122802,  0.16228443,  0.02521779, -0.23700266],
@@ -361,7 +365,9 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar
             dtype=float32)
     >>> rewards
     array([1., 1., 1.])
-    >>> dones
+    >>> terminateds
+    array([False, False, False])
+    >>> truncateds
     array([False, False, False])
     >>> infos
     ({}, {}, {})