diff --git a/docs/scripts/gen_gifs.py b/docs/scripts/gen_gifs.py index ccad23c7..64b880c1 100644 --- a/docs/scripts/gen_gifs.py +++ b/docs/scripts/gen_gifs.py @@ -57,15 +57,16 @@ frames = [] while True: state = env.reset() - done = False - while not done and len(frames) <= LENGTH: + terminated = False + truncated = False + while not (terminated or truncated) and len(frames) <= LENGTH: frame = env.render(mode='rgb_array') repeat = int(60/env.metadata["render_fps"]) if env_type == "toy_text" else 1 for i in range(repeat): frames.append(Image.fromarray(frame)) action = env.action_space.sample() - state_next, reward, done, info = env.step(action) + state_next, reward, terminated, truncated, info = env.step(action) if len(frames) > LENGTH: break diff --git a/docs/source/content/api.md b/docs/source/content/api.md index 7e92d715..be09095a 100644 --- a/docs/source/content/api.md +++ b/docs/source/content/api.md @@ -36,10 +36,8 @@ to a specific point in space. If it succeeds in doing this (or makes some progre alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress). The agent will then be trained to maximize the reward it accumulates over many timesteps. -After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed! In that case, -we want to reset the environment to a new initial state. The environment issues a done signal to the agent if it enters such a terminal state. -Not all done signals must be triggered by a "catastrophic failure": Sometimes we also want to issue a done signal after -a fixed number of timesteps, or if the agent has succeeded in completing some task in the environment. +After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal. +This is a new change in API. Earlier a common done signal was issued for an episode ending via any means. This is now changed in favour of issuing two signals - terminated and truncated. Let's see what the agent-environment loop looks like in Gym. This example will run an instance of `LunarLander-v2` environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the environment @@ -53,9 +51,9 @@ observation, info = env.reset(seed=42, return_info=True) for _ in range(1000): env.render() - observation, reward, done, info = env.step(env.action_space.sample()) + observation, reward, terminated, truncated, info = env.step(env.action_space.sample()) - if done: + if terminated or truncated: observation, info = env.reset(return_info=True) env.close() @@ -73,6 +71,50 @@ the format of valid observations is specified by `env.observation_space`. In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the environment to ensure reproducible samples. +### Change in env.step API + +Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated`. + +Environment termination can happen due to any number of reasons inherent to the environment eg. task completition, failure etc. This is distinctly different from an episode of the environment ending due to a user-set time-limit, referred to as 'truncation'. + +`terminated` signal is set to `True` when the core environment terminates inherently because of task completition, failure etc. +`truncated` signal is set to `True` when the episode ends specifically because of a time-limit not inherent to the environment. + +It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step. + +#### Motivation +This is done to remove the ambiguity in the `done` signal. `done=True` does not distinguish between the environment terminating and the episode truncating. This problem was avoided previously by setting `info['TimeLimit.truncated']` in case of a timelimit through the TimeLimit wrapper. However, this forces the truncation to happen only through a wrapper, and does not accomodate environments which truncate in the core environment. + +The main reason we need to distinguish between these two signals is the implications in implementing the Temporal Difference update, where the reward along with the value of the next state(s) are used to update the value of the current state. At terminal states, there is no contribution from the value of the next state(s) and we only consider the value as a function of the reward at that state. However, this is not true when we forcibly truncate the episode. Time-limits are often set during training to diversify experience, but the goal of the agent is not to maximize reward over this time period. The `done` signal is often used to control whether the value of the next state +is used for backup. +```python +vf_target = rew + gamma * (1-done)* vf(next_obs) +``` +However, this leads to the next state backup being ignored during truncations as well which is incorrect. (See [link](https://arxiv.org/abs/1712.00378) for details) + +Instead, using explicit `terminated` and `truncated` signals resolves this problem. + +```python +# vf_target = rew + gamma * (1-done)* vf(next_obs) # wrong +vf_target = rew + gamma * (1-terminated)* vf(next_obs) # correct +``` + +#### Backward compatibility +Gym will retain support for the old API till v1.0 for ease of transition. + +Users can toggle the old API through `make` by setting `return_two_dones=False`. + +```python +env = gym.make("CartPole-v1", return_two_dones=False) +``` +This can also be done explicitly through a wrapper: +```python +from gym.wrappers import StepCompatibility +env = StepCompatibility(CustomEnv(), return_two_dones=False) +``` +For more details see the wrappers section. + + ## Standard methods ### Stepping @@ -232,7 +274,7 @@ reward based on data in `info`). Such wrappers can be implemented by inheriting from `Wrapper`. Gym already provides many commonly used wrappers for you. Some examples: -- `TimeLimit`: Issue a done signal if a maximum number of timesteps has been exceeded (or the base environment has issued a done signal). +- `TimeLimit`: Issue a truncated signal if a maximum number of timesteps has been exceeded. - `ClipAction`: Clip the action such that it lies in the action space (of type Box). - `RescaleAction`: Rescale actions to lie in a specified interval - `TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov. @@ -275,7 +317,7 @@ where we obtain the corresponding key ID constants from pygame. If the `key_to_a Furthermore, you wish to plot real time statistics as you play, you can use `gym.utils.play.PlayPlot`. Here's some sample code for plotting the reward for last 5 second of gameplay: ```python -def callback(obs_t, obs_tp1, action, rew, done, info): +def callback(obs_t, obs_tp1, action, rew, terminated, truncated, info): return [rew,] plotter = PlayPlot(callback, 30 * 5, ["reward"]) env = gym.make("Pong-v0") diff --git a/docs/source/content/environment_creation.md b/docs/source/content/environment_creation.md index 10709910..8877e795 100644 --- a/docs/source/content/environment_creation.md +++ b/docs/source/content/environment_creation.md @@ -43,7 +43,7 @@ target on the grid that has been placed randomly at the beginning of the episode - Observations provide the location of the target and agent. - There are 4 actions in our environment, corresponding to the movements "right", "up", "left", and "down". -- A done signal is issued as soon as the agent has navigated to the grid cell where the target is located. +- A terminated signal is issued as soon as the agent has navigated to the grid cell where the target is located. - Rewards are binary and sparse, meaning that the immediate reward is always zero, unless the agent has reached the target, then it is 1 An episode in this environment (with `size=5`) might look like this: @@ -136,7 +136,7 @@ terms). In that case, we would have to update the dictionary that is returned by ### Reset The `reset` method will be called to initiate a new episode. You may assume that the `step` method will not -be called before `reset` has been called. Moreover, `reset` should be called whenever a done signal has been issued. +be called before `reset` has been called. Moreover, `reset` should be called whenever a terminated or truncated signal has been issued. Users may pass the `seed` keyword to `reset` to initialize any random number generator that is used by the environment to a deterministic state. It is recommended to use the random number generator `self.np_random` that is provided by the environment's base class, `gym.Env`. If you only use this RNG, you do not need to worry much about seeding, *but you need to remember to @@ -168,9 +168,9 @@ and `_get_info` that we implemented earlier for that: ### Step The `step` method usually contains most of the logic of your environment. It accepts an `action`, computes the state of -the environment after applying that action and returns the 4-tuple `(observation, reward, done, info)`. -Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `done` -accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `done`. To gather +the environment after applying that action and returns the 5-tuple `(observation, reward, terminated, truncated, info)`. +Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `terminated` +accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `terminated`. `truncated` is not set here. It is more convenient to set this through a wrapper as shown below. But if you prefer to not use a wrapper, you could also set it here. To gather `observation` and `info`, we can again make use of `_get_obs` and `_get_info`: ```python @@ -181,13 +181,13 @@ accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computi self._agent_location = np.clip( self._agent_location + direction, 0, self.size - 1 ) - # An episode is done iff the agent has reached the target - done = np.array_equal(self._agent_location, self._target_location) - reward = 1 if done else 0 # Binary sparse rewards + # An episode is terminated iff the agent has reached the target + terminated = np.array_equal(self._agent_location, self._target_location) + reward = 1 if terminated else 0 # Binary sparse rewards observation = self._get_obs() info = self._get_info() - return observation, reward, done, info + return observation, reward, terminated, False, info ``` ### Rendering @@ -289,8 +289,7 @@ register( ``` The keyword argument `max_episode_steps=300` will ensure that GridWorld environments that are instantiated via `gym.make` will be wrapped in a `TimeLimit` wrapper (see [the wrapper documentation](https://www.gymlibrary.ml/pages/wrappers/index) -for more information). A done signal will then be produced if the agent has reached the target *or* 300 steps have been -executed in the current episode. To distinguish truncation and termination, you can check `info["TimeLimit.truncated"]`. +for more information). A terminated signal will then be produced if the agent has reached the target. A `truncated` signal will be issued if 300 steps have been executed in the current episode. Apart from `id` and `entrypoint`, you may pass the following additional keyword arguments to `register`: diff --git a/docs/source/content/vector_api.md b/docs/source/content/vector_api.md index 143323e4..f2cb7954 100644 --- a/docs/source/content/vector_api.md +++ b/docs/source/content/vector_api.md @@ -22,7 +22,7 @@ The following example runs 3 copies of the ``CartPole-v1`` environment in parall >>> envs = gym.vector.make("CartPole-v1", num_envs=3) >>> envs.reset() >>> actions = np.array([1, 0, 1]) ->>> observations, rewards, dones, infos = envs.step(actions) +>>> observations, rewards, terminateds, truncateds, infos = envs.step(actions) >>> observations array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266], @@ -31,7 +31,9 @@ array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266], dtype=float32) >>> rewards array([1., 1., 1.]) ->>> dones +>>> terminateds +array([False, False, False]) +>>> truncateds array([False, False, False]) >>> infos ({}, {}, {}) @@ -91,7 +93,7 @@ While standard Gym environments take a single action and return a single observa dtype=float32) >>> actions = np.array([1, 0, 1]) - >>> observations, rewards, dones, infos = envs.step(actions) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions) >>> observations array([[ 0.00187507, 0.18986781, -0.03168437, -0.301252 ], @@ -100,7 +102,9 @@ While standard Gym environments take a single action and return a single observa dtype=float32) >>> rewards array([1., 1., 1.]) - >>> dones + >>> terminateds + array([False, False, False]) + >>> truncateds array([False, False, False]) >>> infos ({}, {}, {}) @@ -123,7 +127,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t ... ... def step(self, action): ... observation = self.observation_space.sample() - ... return (observation, 0., False, {}) + ... return (observation, 0., False, False, {}) >>> envs = gym.vector.AsyncVectorEnv([lambda: DictEnv()] * 3) >>> envs.observation_space @@ -137,7 +141,7 @@ Vectorized environments are compatible with any sub-environment, regardless of t ... "jump": np.array([0, 1, 0]), ... "acceleration": np.random.uniform(-1., 1., size=(3, 2)) ... } - >>> observations, rewards, dones, infos = envs.step(actions) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions) >>> observations {"position": array([[-0.5337036 , 0.7439302 , 0.41748118], [ 0.9373266 , -0.5780453 , 0.8987405 ], @@ -152,8 +156,8 @@ The sub-environments inside a vectorized environment automatically call `gym.Env >>> envs = gym.vector.make("FrozenLake-v1", num_envs=3, is_slippery=False) >>> envs.reset() array([0, 0, 0]) - >>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 2])) - >>> observations, rewards, dones, infos = envs.step(np.array([1, 2, 1])) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 2])) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([1, 2, 1])) >>> dones array([False, False, True]) @@ -201,7 +205,7 @@ This is convenient, for example, if you instantiate a policy. In the following e ... ) >>> observations = envs.reset() >>> actions = policy(weights, observations).argmax(axis=1) - >>> observations, rewards, dones, infos = envs.step(actions) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions) ## Intermediate Usage @@ -235,11 +239,11 @@ Because sometimes things may not go as planned, the exceptions raised in sub-env ... if action == 1: ... raise ValueError("An error occurred.") ... observation = self.observation_space.sample() - ... return (observation, 0., False, {}) + ... return (observation, 0., False, False, {}) >>> envs = gym.vector.AsyncVectorEnv([lambda: ErrorEnv()] * 3) >>> observations = envs.reset() - >>> observations, rewards, dones, infos = envs.step(np.array([0, 0, 1])) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([0, 0, 1])) ERROR: Received the following error from Worker-2: ValueError: An error occurred. ERROR: Shutting down Worker-2. ERROR: Raising the last exception back to the main process. @@ -272,15 +276,15 @@ In the following example, we create a new environment `SMILESEnv`, whose observa ... ... def step(self, action): ... self._state += self.observation_space.symbols[action] -... reward = done = (action == 0) -... return (self._state, float(reward), done, {}) +... reward = terminated = (action == 0) +... return (self._state, float(reward), terminated, False, {}) >>> envs = gym.vector.AsyncVectorEnv( ... [lambda: SMILESEnv()] * 3, ... shared_memory=False ... ) >>> envs.reset() ->>> observations, rewards, dones, infos = envs.step(np.array([2, 5, 4])) +>>> observations, rewards, terminateds, truncateds, infos = envs.step(np.array([2, 5, 4])) >>> observations ('[(', '[O', '[C') ``` @@ -352,7 +356,7 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar >>> envs = gym.vector.make("CartPole-v1", num_envs=3) >>> envs.reset() >>> actions = np.array([1, 0, 1]) - >>> observations, rewards, dones, infos = envs.step(actions) + >>> observations, rewards, terminateds, truncateds, infos = envs.step(actions) >>> observations array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266], @@ -361,7 +365,9 @@ If you use `AsyncVectorEnv` with a custom observation space, you must set ``shar dtype=float32) >>> rewards array([1., 1., 1.]) - >>> dones + >>> terminateds + array([False, False, False]) + >>> truncateds array([False, False, False]) >>> infos ({}, {}, {}) diff --git a/docs/source/content/wrappers.md b/docs/source/content/wrappers.md index a9f902ad..78b44aa0 100644 --- a/docs/source/content/wrappers.md +++ b/docs/source/content/wrappers.md @@ -131,31 +131,31 @@ class ClipReward(gym.RewardWrapper): Some users may want a wrapper which will automatically reset its wrapped environment when its wrapped environment reaches the done state. An advantage of this environment is that it will never produce undefined behavior as standard gym environments do when stepping beyond the done state. -When calling step causes self.env.step() to return done, +When calling step causes self.env.step() to return (terminated or truncated)=True), self.env.reset() is called, and the return format of self.step() is as follows: ```python -new_obs, terminal_reward, terminal_done, info +new_obs, closing_reward, closing_terminated, closing_truncated, info ``` new_obs is the first observation after calling self.env.reset(), -terminal_reward is the reward after calling self.env.step(), +closing_reward is the reward after calling self.env.step(), prior to calling self.env.reset() -terminal_done is always True +The expression (closing_terminated or closing_truncated) is always True info is a dict containing all the keys from the info dict returned by -the call to self.env.reset(), with an additional key "terminal_observation" +the call to self.env.reset(), with an additional key "closing_observation" containing the observation returned by the last call to self.env.step() -and "terminal_info" containing the info dict returned by the last call +and "closing_info" containing the info dict returned by the last call to self.env.step(). -If done is not true when self.env.step() is called, self.step() returns +If (terminated or truncated) is not true when self.env.step() is called, self.step() returns ```python -obs, reward, done, info +obs, reward, terminated, truncated, info ``` as normal. @@ -175,15 +175,83 @@ The AutoResetWrapper can also be applied using its constructor: ### Warning When using the AutoResetWrapper to collect rollouts, note -that the when self.env.step() returns done, a +that the when self.env.step() returns (terminated or truncated)=True, a new observation from after calling self.env.reset() is returned by self.step() alongside the terminal reward and done state from the previous episode . If you need the terminal state from the previous -episode, you need to retrieve it via the the "terminal_observation" key +episode, you need to retrieve it via the the "closing_observation" key in the info dict. Make sure you know what you're doing if you use this wrapper! +## StepCompatibilityWrapper +Due to the breaking change with step method returning two bools instead of one, this wrapper is introduced for ease of transition. This wrapper is applied by default in make to transform any environment into the new API. + +```python +>>> import gym +>>> env = gym.make("CartPole-v1") +>>> env +>>>> +>>> env.reset() +array([-0.03018865, -0.02190439, -0.02665936, 0.02980426], dtype=float32) +>>> env.step(env.action_space.sample()) +(array([-0.03062674, 0.17358953, -0.02606327, -0.27116933], dtype=float32), 1.0, False, False, {}) +``` + +`return_two_dones` can be set at make to transform new step API to the old one. + +```python +>>> env = gym.make("CartPole-v1", return_two_dones=False) +>>> env.reset() +array([0.02902522, 0.01894217, 0.00593221, 0.03430589], dtype=float32) +>>> env.step(env.action_space.sample()) +(array([ 0.03368363, 0.40900537, 0.00148834, -0.54708755], dtype=float32), 1.0, False, {}) +``` + +Registered environments in old API are automatically transformed into new API since this is the default setting (for eg. atari envs) + +```python +>>> env = gym.make("ALE/Breakout-v5") +>>> obs = env.reset() +>>> step_returns = env.step(env.action_space.sample()) +>>> len(step_returns) +5 +``` + +To retain old API, set `return_two_dones=False` + +```python +>>> env = gym.make("ALE/Breakout-v5", return_two_dones=False) +>>> obs = env.reset() +>>> step_returns = env.step(env.action_space.sample()) +>>> len(step_returns) +4 +``` + +### StepCompatibilityVectorWrapper +Vector envs do not directly support old step API. Instead, `StepCompatibilityVector` wrapper can be used, while setting `return_two_dones=False`. This can be done in make for registered environments as well as explicitly. + + +```python +from gym.vector import StepCompatibilityVector, SyncVectorEnv + +>>> envs = gym.vector.make("CartPole-v1", num_envs=3) +>>> obs = envs.reset() +>>> step_returns = envs.step(envs.action_space.sample()) +>>> len(step_returns) +5 +>>> envs = gym.vector.make("CartPole-v1", num_envs=3, return_two_dones=False) +>>> obs = envs.reset() +>>> step_returns = envs.step(envs.action_space.sample()) +>>> len(step_returns) +4 +>>> envs = StepCompatibilityVector(SyncVectorEnv([NewAPIEnv, NewAPIEnv]), return_two_dones=False) +>>> obs = envs.reset() +>>> step_returns = envs.step(envs.action_space.sample()) +>>> len(step_returns) +4 +``` +Here, NewAPIEnv is any environment class defined with the new API, not defined here. ## General Wrappers @@ -210,9 +278,9 @@ class ReacherRewardWrapper(gym.Wrapper): self.reward_ctrl_weight = reward_ctrl_weight def step(self, action): - obs, _, done, info = self.env.step(action) + obs, _, terminated, truncated, info = self.env.step(action) reward = self.reward_dist_weight*info["reward_dist"] + self.reward_ctrl_weight*info["reward_ctrl"] - return obs, reward, done, info + return obs, reward, terminated, truncated, info ``` ```{note} @@ -237,6 +305,8 @@ It is *not* sufficient to use a `RewardWrapper` in this case! | `RecordVideo` | `gym.Wrapper` | `env`, `video_folder: str`, `episode_trigger: Callable[[int], bool] = None`, `step_trigger: Callable[[int], bool] = None`, `video_length: int = 0`, `name_prefix: str = "rl-video"` | This wrapper will record videos of rollouts. The results will be saved in the folder specified via `video_folder`. You can specify a prefix for the filenames via `name_prefix`. Usually, you only want to record the environment intermittently, say every hundreth episode. To allow this, you can pass `episode_trigger` or `step_trigger`. At most one of these should be passed. These functions will accept an episode index or step index, respectively. They should return a boolean that indicates whether a recording should be started at this point. If neither `episode_trigger`, nor `step_trigger` is passed, a default `episode_trigger` will be used. By default, the recording will be stopped once a done signal has been emitted by the environment. However, you can also create recordings of fixed length (possibly spanning several episodes) by passing a strictly positive value for `video_length`. | | `RescaleAction` | `gym.ActionWrapper` | `env`, `min_action`, `max_action` | Rescales the continuous action space of the environment to a range \[`min_action`, `max_action`], where `min_action` and `max_action` are numpy arrays or floats. | | `ResizeObservation` | `gym.ObservationWrapper` | `env`, `shape` | This wrapper works on environments with image observations (or more generally observations of shape AxBxC) and resizes the observation to the shape given by the tuple `shape`. The argument `shape` may also be an integer. In that case, the observation is scaled to a square of sidelength `shape` | +| `StepCompatibility` | `gym.Wrapper` | `env`, `return_two_dones: bool` | Transforms environments from old step API to new and vice-versa. Old env.step returns one boolean `done`. New API returns two booleans `terminated` and `truncated`. | +| `StepCompatibilityVector` | `gym.vector.VectorEnvWrapper` | `env: gym.vector.VectorEnv`, `return_two_dones: bool` | Transforms vector environments from new step API to old. Old env.step returns one boolean vector `dones`. New API returns two booleans `terminateds` and `truncateds`. | `TimeAwareObservation` | `gym.ObservationWrapper` | `env` | Augment the observation with current time step in the trajectory (by appending it to the observation). This can be useful to ensure that things stay Markov. Currently it only works with one-dimensional observation spaces. | | `TimeLimit` | `gym.Wrapper` | `env`, `max_episode_steps=None` | Probably the most useful wrapper in Gym. This wrapper will emit a done signal if the speciefied number of steps is exceeded in an episode. In order to be able to distinguish termination and truncation, you need to check `info`. If it does not contain the key `"TimeLimit.truncated"`, the environment did not reach the timelimit. Otherwise, `info["TimeLimit.truncated"]` will be true if the episode was terminated because of the time limit. | | `TransformObservation` | `gym.ObservationWrapper` | `env`, `f` | This wrapper will apply `f` to observations | diff --git a/docs/source/index.md b/docs/source/index.md index 1fbfdfc5..b0677b10 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -21,9 +21,9 @@ hide-toc: true for _ in range(1000): env.render() action = policy(observation) # User-defined policy function - observation, reward, done, info = env.step(action) + observation, reward, terminated, truncated, info = env.step(action) - if done: + if terminated or truncated: observation, info = env.reset(return_info=True) env.close() ```