diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index e48b065e4ea..4c376ad0c26 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -11,7 +11,6 @@ jobs:
- uses: actions/checkout@v2
- run: |
docker build -f py.Dockerfile \
- --build-arg MUJOCO_KEY=$MUJOCO_KEY \
--build-arg PYTHON_VERSION=${{ matrix.python-version }} \
--tag gym-docker .
- name: Run tests
diff --git a/.github/workflows/lint_python.yml b/.github/workflows/lint_python.yml
index 65d69454c66..1855cbeeecb 100644
--- a/.github/workflows/lint_python.yml
+++ b/.github/workflows/lint_python.yml
@@ -8,13 +8,15 @@ jobs:
strategy:
matrix:
python-platform: ["Linux"]
- python-version: ["3.7"]
+ python-version: ["3.7", "3.8", "3.9", "3.10"]
fail-fast: false
env:
- PYRIGHT_VERSION: 1.1.204
+ PYRIGHT_VERSION: 1.1.235
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
+ with:
+ python-version: ${{ matrix.python-version }}
- run: pip install -e .[nomujoco]
- uses: jakebailey/pyright-action@v1
with:
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 994fd61721c..fbbf53e2649 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -9,7 +9,7 @@ repos:
hooks:
- id: codespell
args:
- - --ignore-words-list=nd,reacher,thist,ths
+ - --ignore-words-list=nd,reacher,thist,ths, ure, referenc
- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2
hooks:
diff --git a/README.md b/README.md
index 06555fe0288..d6649dfd2dd 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,9 @@
Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.
-Gym documentation website is at [https://www.gymlibrary.ml/](https://www.gymlibrary.ml/), and you can propose fixes and changes [here](https://github.com/Farama-Foundation/gym-docs).
+Gym documentation website is at [https://www.gymlibrary.ml/](https://www.gymlibrary.ml/), and you can propose fixes and changes to it [here](https://github.com/Farama-Foundation/gym-docs).
+
+Gym also has a discord server for development purposes that you can join here: https://discord.gg/nHg2JRN489
## Installation
@@ -44,6 +46,11 @@ env.close()
Gym keeps strict versioning for reproducibility reasons. All environments end in a suffix like "\_v0". When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion.
+## MuJoCo Environments
+
+The latest "\_v4" and future versions of the MuJoCo environments will no longer depend on `mujoco-py`. Instead `mujoco` will be the required dependency for future gym MuJoCo environment versions. Old gym MuJoCo environment versions that depend on `mujoco-py` will still be kept but unmaintained.
+To install the dependencies for the latest gym MuJoCo environments use `pip install gym[mujoco]`. Dependencies for old MuJoCo environments can still be installed by `pip install gym[mujoco_py]`.
+
## Citation
A whitepaper from when Gym just came out is available https://arxiv.org/pdf/1606.01540, and can be cited with the following bibtex entry:
diff --git a/gym/core.py b/gym/core.py
index 266ffb9dc7b..eabaea69508 100644
--- a/gym/core.py
+++ b/gym/core.py
@@ -78,7 +78,7 @@ def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]:
reward (float) : amount of reward returned after previous action
done (bool): whether the episode has ended, in which case further :meth:`step` calls will return undefined results. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state. ``info`` may contain additional information regarding the reason for a ``done`` signal.
info (dict): contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain:
-
+
- metrics that describe the agent's performance or
- state variables that are hidden from observations or
- information that distinguishes truncation and termination or
@@ -106,7 +106,8 @@ def reset(
integer seed right after initialization and then never again.
Args:
- seed (int or None): The seed that is used to initialize the environment's PRNG. If the environment does not already have a PRNG and ``seed=None`` (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and ``seed=None`` is pased, the PRNG will *not* be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer *right after the environment has been initialized and then never again*. Please refer to the minimal example above to see this paradigm in action.
+ seed (int or None): The seed that is used to initialize the environment's PRNG. If the environment does not already have a PRNG and ``seed=None`` (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and ``seed=None`` is passed, the PRNG will *not* be reset.
+ If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer *right after the environment has been initialized and then never again*. Please refer to the minimal example above to see this paradigm in action.
return_info (bool): If true, return additional information along with initial observation. This info should be analogous to the info returned in :meth:`step`
options (dict or None): Additional information to specify how the environment is reset (optional, depending on the specific environment)
@@ -135,7 +136,7 @@ def render(self, mode="human"):
- ansi: Return a string (str) or StringIO.StringIO containing a
terminal-style text representation. The text can include newlines
and ANSI escape sequences (e.g. for colors).
-
+
Note:
Make sure that your class's metadata 'render_modes' key includes
the list of supported modes. It's recommended to call super()
diff --git a/gym/envs/__init__.py b/gym/envs/__init__.py
index 78fe08c6844..7b6936c1f36 100644
--- a/gym/envs/__init__.py
+++ b/gym/envs/__init__.py
@@ -89,6 +89,14 @@
reward_threshold=900,
)
+register(
+ id="CarRacingDomainRandomize-v1",
+ entry_point="gym.envs.box2d:CarRacing",
+ kwargs={"domain_randomize": True},
+ max_episode_steps=1000,
+ reward_threshold=900,
+)
+
# Toy Text
# ----------------------------------------
@@ -138,6 +146,13 @@
reward_threshold=-3.75,
)
+register(
+ id="Reacher-v4",
+ entry_point="gym.envs.mujoco.reacher_v4:ReacherEnv",
+ max_episode_steps=50,
+ reward_threshold=-3.75,
+)
+
register(
id="Pusher-v2",
entry_point="gym.envs.mujoco:PusherEnv",
@@ -145,6 +160,13 @@
reward_threshold=0.0,
)
+register(
+ id="Pusher-v4",
+ entry_point="gym.envs.mujoco.pusher_v4:PusherEnv",
+ max_episode_steps=100,
+ reward_threshold=0.0,
+)
+
register(
id="InvertedPendulum-v2",
entry_point="gym.envs.mujoco:InvertedPendulumEnv",
@@ -152,6 +174,13 @@
reward_threshold=950.0,
)
+register(
+ id="InvertedPendulum-v4",
+ entry_point="gym.envs.mujoco.inverted_pendulum_v4:InvertedPendulumEnv",
+ max_episode_steps=1000,
+ reward_threshold=950.0,
+)
+
register(
id="InvertedDoublePendulum-v2",
entry_point="gym.envs.mujoco:InvertedDoublePendulumEnv",
@@ -159,6 +188,13 @@
reward_threshold=9100.0,
)
+register(
+ id="InvertedDoublePendulum-v4",
+ entry_point="gym.envs.mujoco.inverted_double_pendulum_v4:InvertedDoublePendulumEnv",
+ max_episode_steps=1000,
+ reward_threshold=9100.0,
+)
+
register(
id="HalfCheetah-v2",
entry_point="gym.envs.mujoco:HalfCheetahEnv",
@@ -173,6 +209,13 @@
reward_threshold=4800.0,
)
+register(
+ id="HalfCheetah-v4",
+ entry_point="gym.envs.mujoco.half_cheetah_v4:HalfCheetahEnv",
+ max_episode_steps=1000,
+ reward_threshold=4800.0,
+)
+
register(
id="Hopper-v2",
entry_point="gym.envs.mujoco:HopperEnv",
@@ -187,6 +230,13 @@
reward_threshold=3800.0,
)
+register(
+ id="Hopper-v4",
+ entry_point="gym.envs.mujoco.hopper_v4:HopperEnv",
+ max_episode_steps=1000,
+ reward_threshold=3800.0,
+)
+
register(
id="Swimmer-v2",
entry_point="gym.envs.mujoco:SwimmerEnv",
@@ -201,6 +251,13 @@
reward_threshold=360.0,
)
+register(
+ id="Swimmer-v4",
+ entry_point="gym.envs.mujoco.swimmer_v4:SwimmerEnv",
+ max_episode_steps=1000,
+ reward_threshold=360.0,
+)
+
register(
id="Walker2d-v2",
max_episode_steps=1000,
@@ -213,6 +270,12 @@
entry_point="gym.envs.mujoco.walker2d_v3:Walker2dEnv",
)
+register(
+ id="Walker2d-v4",
+ max_episode_steps=1000,
+ entry_point="gym.envs.mujoco.walker2d_v4:Walker2dEnv",
+)
+
register(
id="Ant-v2",
entry_point="gym.envs.mujoco:AntEnv",
@@ -227,6 +290,13 @@
reward_threshold=6000.0,
)
+register(
+ id="Ant-v4",
+ entry_point="gym.envs.mujoco.ant_v4:AntEnv",
+ max_episode_steps=1000,
+ reward_threshold=6000.0,
+)
+
register(
id="Humanoid-v2",
entry_point="gym.envs.mujoco:HumanoidEnv",
@@ -239,8 +309,20 @@
max_episode_steps=1000,
)
+register(
+ id="Humanoid-v4",
+ entry_point="gym.envs.mujoco.humanoid_v4:HumanoidEnv",
+ max_episode_steps=1000,
+)
+
register(
id="HumanoidStandup-v2",
entry_point="gym.envs.mujoco:HumanoidStandupEnv",
max_episode_steps=1000,
)
+
+register(
+ id="HumanoidStandup-v4",
+ entry_point="gym.envs.mujoco.humanoidstandup_v4:HumanoidStandupEnv",
+ max_episode_steps=1000,
+)
diff --git a/gym/envs/box2d/bipedal_walker.py b/gym/envs/box2d/bipedal_walker.py
index 8fc2dc8f643..ee072cf5311 100644
--- a/gym/envs/box2d/bipedal_walker.py
+++ b/gym/envs/box2d/bipedal_walker.py
@@ -181,12 +181,71 @@ def __init__(self, hardcore: bool = False):
categoryBits=0x0001,
)
- high = np.array([np.inf] * 24).astype(np.float32)
+ # we use 5.0 to represent the joints moving at maximum
+ # 5 x the rated speed due to impulses from ground contact etc.
+ low = np.array(
+ [
+ -math.pi,
+ -5.0,
+ -5.0,
+ -5.0,
+ -math.pi,
+ -5.0,
+ -math.pi,
+ -5.0,
+ -0.0,
+ -math.pi,
+ -5.0,
+ -math.pi,
+ -5.0,
+ -0.0,
+ ]
+ + [-1.0] * 10
+ ).astype(np.float32)
+ high = np.array(
+ [
+ math.pi,
+ 5.0,
+ 5.0,
+ 5.0,
+ math.pi,
+ 5.0,
+ math.pi,
+ 5.0,
+ 5.0,
+ math.pi,
+ 5.0,
+ math.pi,
+ 5.0,
+ 5.0,
+ ]
+ + [1.0] * 10
+ ).astype(np.float32)
self.action_space = spaces.Box(
np.array([-1, -1, -1, -1]).astype(np.float32),
np.array([1, 1, 1, 1]).astype(np.float32),
)
- self.observation_space = spaces.Box(-high, high)
+ self.observation_space = spaces.Box(low, high)
+
+ # state = [
+ # self.hull.angle, # Normal angles up to 0.5 here, but sure more is possible.
+ # 2.0 * self.hull.angularVelocity / FPS,
+ # 0.3 * vel.x * (VIEWPORT_W / SCALE) / FPS, # Normalized to get -1..1 range
+ # 0.3 * vel.y * (VIEWPORT_H / SCALE) / FPS,
+ # self.joints[
+ # 0
+ # ].angle, # This will give 1.1 on high up, but it's still OK (and there should be spikes on hiting the ground, that's normal too)
+ # self.joints[0].speed / SPEED_HIP,
+ # self.joints[1].angle + 1.0,
+ # self.joints[1].speed / SPEED_KNEE,
+ # 1.0 if self.legs[1].ground_contact else 0.0,
+ # self.joints[2].angle,
+ # self.joints[2].speed / SPEED_HIP,
+ # self.joints[3].angle + 1.0,
+ # self.joints[3].speed / SPEED_KNEE,
+ # 1.0 if self.legs[3].ground_contact else 0.0,
+ # ]
+ # state += [l.fraction for l in self.lidar]
def _destroy(self):
if not self.terrain:
@@ -444,7 +503,7 @@ def ReportFixture(self, fixture, point, normal, fraction):
else:
return self.step(np.array([0, 0, 0, 0]))[0], {}
- def step(self, action):
+ def step(self, action: np.ndarray):
# self.hull.ApplyForceToCenter((0, 20), True) -- Uncomment this to receive a bit of stability help
control_speed = False # Should be easier as well
if control_speed:
@@ -489,9 +548,8 @@ def step(self, action):
2.0 * self.hull.angularVelocity / FPS,
0.3 * vel.x * (VIEWPORT_W / SCALE) / FPS, # Normalized to get -1..1 range
0.3 * vel.y * (VIEWPORT_H / SCALE) / FPS,
- self.joints[
- 0
- ].angle, # This will give 1.1 on high up, but it's still OK (and there should be spikes on hiting the ground, that's normal too)
+ self.joints[0].angle,
+ # This will give 1.1 on high up, but it's still OK (and there should be spikes on hiting the ground, that's normal too)
self.joints[0].speed / SPEED_HIP,
self.joints[1].angle + 1.0,
self.joints[1].speed / SPEED_KNEE,
@@ -531,7 +589,7 @@ def step(self, action):
done = True
return np.array(state, dtype=np.float32), reward, done, {}
- def render(self, mode="human"):
+ def render(self, mode: str = "human"):
import pygame
from pygame import gfxdraw
diff --git a/gym/envs/box2d/car_racing.py b/gym/envs/box2d/car_racing.py
index b95c2cde04b..de2b1811625 100644
--- a/gym/envs/box2d/car_racing.py
+++ b/gym/envs/box2d/car_racing.py
@@ -34,8 +34,6 @@
BORDER = 8 / SCALE
BORDER_MIN_COUNT = 4
-ROAD_COLOR = [0.4, 0.4, 0.4]
-
class FrictionDetector(contactListener):
def __init__(self, env, lap_complete_percent):
@@ -63,9 +61,8 @@ def _contact(self, contact, begin):
if not tile:
return
- tile.color[0] = ROAD_COLOR[0]
- tile.color[1] = ROAD_COLOR[1]
- tile.color[2] = ROAD_COLOR[2]
+ # inherit tile color from env
+ tile.color = self.env.road_color / 255
if not obj or "tiles" not in obj.__dict__:
return
if begin:
@@ -128,10 +125,15 @@ class CarRacing(gym.Env, EzPickle):
receive -100 reward and die.
### Arguments
- There are no arguments supported in constructing the environment.
+ `lap_complete_percent` dictates the percentage of tiles that must be visited by
+ the agent before a lap is considered complete.
+
+ Passing `domain_randomize=True` enabled the domain randomized variant of the environment.
+ In this scenario, the background and track colours are different on every reset.
### Version History
- - v0: Current version
+ - v1: Change track completion logic and add domain randomization (0.24.0)
+ - v0: Original version
### References
- Chris Campbell (2014), http://www.iforce2d.net/b2dtut/top-down-car.
@@ -145,8 +147,16 @@ class CarRacing(gym.Env, EzPickle):
"render_fps": FPS,
}
- def __init__(self, verbose=1, lap_complete_percent=0.95):
+ def __init__(
+ self,
+ verbose: bool = True,
+ lap_complete_percent: float = 0.95,
+ domain_randomize: bool = False,
+ ):
EzPickle.__init__(self)
+ self.domain_randomize = domain_randomize
+ self._init_colors()
+
self.contactListener_keepref = FrictionDetector(self, lap_complete_percent)
self.world = Box2D.b2World((0, 0), contactListener=self.contactListener_keepref)
self.screen = None
@@ -183,6 +193,22 @@ def _destroy(self):
self.road = []
self.car.destroy()
+ def _init_colors(self):
+ if self.domain_randomize:
+ # domain randomize the bg and grass colour
+ self.road_color = self.np_random.uniform(0, 210, size=3)
+
+ self.bg_color = self.np_random.uniform(0, 210, size=3)
+
+ self.grass_color = np.copy(self.bg_color)
+ idx = self.np_random.integers(3)
+ self.grass_color[idx] += 20
+ else:
+ # default colours
+ self.road_color = np.array([102, 102, 102])
+ self.bg_color = np.array([102, 204, 102])
+ self.grass_color = np.array([102, 230, 102])
+
def _create_track(self):
CHECKPOINTS = 12
@@ -280,7 +306,7 @@ def _create_track(self):
elif pass_through_start and i1 == -1:
i1 = i
break
- if self.verbose == 1:
+ if self.verbose:
print("Track generation: %i..%i -> %i-tiles track" % (i1, i2, i2 - i1))
assert i1 != -1
assert i2 != -1
@@ -338,8 +364,8 @@ def _create_track(self):
self.fd_tile.shape.vertices = vertices
t = self.world.CreateStaticBody(fixtures=self.fd_tile)
t.userData = t
- c = 0.01 * (i % 3)
- t.color = [ROAD_COLOR[0] + c, ROAD_COLOR[1] + c, ROAD_COLOR[2] + c]
+ c = 0.01 * (i % 3) * 255
+ t.color = self.road_color + c
t.road_visited = False
t.road_friction = 1.0
t.idx = i
@@ -365,7 +391,10 @@ def _create_track(self):
y2 + side * (TRACK_WIDTH + BORDER) * math.sin(beta2),
)
self.road_poly.append(
- ([b1_l, b1_r, b2_r, b2_l], (1, 1, 1) if i % 2 == 0 else (1, 0, 0))
+ (
+ [b1_l, b1_r, b2_r, b2_l],
+ (255, 255, 255) if i % 2 == 0 else (255, 0, 0),
+ )
)
self.track = track
return True
@@ -385,12 +414,13 @@ def reset(
self.t = 0.0
self.new_lap = False
self.road_poly = []
+ self._init_colors()
while True:
success = self._create_track()
if success:
break
- if self.verbose == 1:
+ if self.verbose:
print(
"retry to generate track (normal if there are not many"
"instances of this message)"
@@ -402,7 +432,7 @@ def reset(
else:
return self.step(None)[0], {}
- def step(self, action):
+ def step(self, action: np.ndarray):
if action is not None:
self.car.steer(-action[0])
self.car.gas(action[1])
@@ -432,7 +462,7 @@ def step(self, action):
return self.state, step_reward, done, {}
- def render(self, mode="human"):
+ def render(self, mode: str = "human"):
import pygame
pygame.font.init()
@@ -459,13 +489,13 @@ def render(self, mode="human"):
trans = pygame.math.Vector2((scroll_x, scroll_y)).rotate_rad(angle)
trans = (WINDOW_W / 2 + trans[0], WINDOW_H / 4 + trans[1])
- self.render_road(zoom, trans, angle)
+ self._render_road(zoom, trans, angle)
self.car.draw(self.surf, zoom, trans, angle, mode != "state_pixels")
self.surf = pygame.transform.flip(self.surf, False, True)
# showing stats
- self.render_indicators(WINDOW_W, WINDOW_H)
+ self._render_indicators(WINDOW_W, WINDOW_H)
font = pygame.font.Font(pygame.font.get_default_font(), 42)
text = font.render("%04i" % self.reward, True, (255, 255, 255), (0, 0, 0))
@@ -487,7 +517,7 @@ def render(self, mode="human"):
else:
return self.isopen
- def render_road(self, zoom, translation, angle):
+ def _render_road(self, zoom, translation, angle):
bounds = PLAYFIELD
field = [
(2 * bounds, 2 * bounds),
@@ -495,11 +525,13 @@ def render_road(self, zoom, translation, angle):
(0, 0),
(0, 2 * bounds),
]
- trans_field = []
- self.draw_colored_polygon(
- self.surf, field, (102, 204, 102), zoom, translation, angle
+
+ # draw background
+ self._draw_colored_polygon(
+ self.surf, field, self.bg_color, zoom, translation, angle
)
+ # draw grass patches
k = bounds / (20.0)
grass = []
for x in range(0, 40, 2):
@@ -513,17 +545,18 @@ def render_road(self, zoom, translation, angle):
]
)
for poly in grass:
- self.draw_colored_polygon(
- self.surf, poly, (102, 230, 102), zoom, translation, angle
+ self._draw_colored_polygon(
+ self.surf, poly, self.grass_color, zoom, translation, angle
)
+ # draw road
for poly, color in self.road_poly:
# converting to pixel coordinates
poly = [(p[0] + PLAYFIELD, p[1] + PLAYFIELD) for p in poly]
- color = [int(c * 255) for c in color]
- self.draw_colored_polygon(self.surf, poly, color, zoom, translation, angle)
+ color = [int(c) for c in color]
+ self._draw_colored_polygon(self.surf, poly, color, zoom, translation, angle)
- def render_indicators(self, W, H):
+ def _render_indicators(self, W, H):
import pygame
s = W / 40.0
@@ -592,7 +625,7 @@ def render_if_min(value, points, color):
(255, 0, 0),
)
- def draw_colored_polygon(self, surface, poly, color, zoom, translation, angle):
+ def _draw_colored_polygon(self, surface, poly, color, zoom, translation, angle):
import pygame
from pygame import gfxdraw
diff --git a/gym/envs/box2d/lunar_lander.py b/gym/envs/box2d/lunar_lander.py
index 75d9cdfe538..c76646247ff 100644
--- a/gym/envs/box2d/lunar_lander.py
+++ b/gym/envs/box2d/lunar_lander.py
@@ -123,8 +123,33 @@ class LunarLander(gym.Env, EzPickle):
`continuous=True` argument like below:
```python
import gym
- env = gym.make("LunarLander-v2", continuous=True)
+ env = gym.make(
+ "LunarLander-v2",
+ continuous: bool = False,
+ gravity: float = -10.0,
+ enable_wind: bool = False,
+ wind_power: float = 15.0,
+ )
```
+ If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the
+ action space will be `Box(-1, +1, (2,), dtype=np.float32)`.
+ The first coordinate of an action determines the throttle of the main engine, while the second
+ coordinate specifies the throttle of the lateral boosters.
+ Given an action `np.array([main, lateral])`, the main engine will be turned off completely if
+ `main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the
+ main engine doesn't work with less than 50% power).
+ Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left
+ booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely
+ from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively).
+
+ `gravity` dictates the gravitational constant, this is bounded to be within 0 and -12.
+
+ If `enable_wind=True` is passed, there will be wind effects applied to the lander.
+ The wind is generated using the function `tanh(sin(2 k (t+C)) + sin(pi k (t+C)))`.
+ `k` is set to 0.01.
+ `C` is sampled randomly between -9999 and 9999.
+
+ `wind_power` dictates the maximum magnitude of wind.
### Version History
- v2: Count energy spent
@@ -141,12 +166,32 @@ class LunarLander(gym.Env, EzPickle):
metadata = {"render_modes": ["human", "rgb_array"], "render_fps": FPS}
- def __init__(self, continuous: bool = False):
+ def __init__(
+ self,
+ continuous: bool = False,
+ gravity: float = -10.0,
+ enable_wind: bool = False,
+ wind_power: float = 15.0,
+ ):
EzPickle.__init__(self)
+
+ assert (
+ -12.0 < gravity and gravity < 0.0
+ ), f"gravity (current value: {gravity}) must be between -12 and 0"
+ self.gravity = gravity
+
+ assert (
+ 0.0 < wind_power and wind_power < 20.0
+ ), f"wind_power (current value: {wind_power}) must be between 0 and 20"
+ self.wind_power = wind_power
+
+ self.enable_wind = enable_wind
+ self.wind_idx = np.random.randint(-9999, 9999)
+
self.screen = None
self.clock = None
self.isopen = True
- self.world = Box2D.b2World()
+ self.world = Box2D.b2World(gravity=(0, gravity))
self.moon = None
self.lander = None
self.particles = []
@@ -155,10 +200,41 @@ def __init__(self, continuous: bool = False):
self.continuous = continuous
+ low = np.array(
+ [
+ # these are bounds for position
+ # realistically the environment should have ended
+ # long before we reach more than 50% outside
+ -1.5,
+ -1.5,
+ # velocity bounds is 5x rated speed
+ -5.0,
+ -5.0,
+ -math.pi,
+ -5.0,
+ -0.0,
+ -0.0,
+ ]
+ ).astype(np.float32)
+ high = np.array(
+ [
+ # these are bounds for position
+ # realistically the environment should have ended
+ # long before we reach more than 50% outside
+ 1.5,
+ 1.5,
+ # velocity bounds is 5x rated speed
+ 5.0,
+ 5.0,
+ math.pi,
+ 5.0,
+ 1.0,
+ 1.0,
+ ]
+ ).astype(np.float32)
+
# useful range is -1 .. +1, but spikes can be higher
- self.observation_space = spaces.Box(
- -np.inf, np.inf, shape=(8,), dtype=np.float32
- )
+ self.observation_space = spaces.Box(low, high)
if self.continuous:
# Action is two floats [main engine, left-right engines].
@@ -320,6 +396,25 @@ def _clean_particles(self, all):
self.world.DestroyBody(self.particles.pop(0))
def step(self, action):
+ # Update wind
+ if self.enable_wind and not (
+ self.legs[0].ground_contact or self.legs[1].ground_contact
+ ):
+ # the function used for wind is tanh(sin(2 k x) + sin(pi k x)),
+ # which is proven to never be periodic, k = 0.01
+ wind_mag = (
+ math.tanh(
+ math.sin(0.02 * self.wind_idx)
+ + (math.sin(math.pi * 0.01 * self.wind_idx))
+ )
+ * self.wind_power
+ )
+ self.wind_idx += 1
+ self.lander.ApplyForceToCenter(
+ (wind_mag, 0.0),
+ True,
+ )
+
if self.continuous:
action = np.clip(action, -1, +1).astype(np.float32)
else:
@@ -342,9 +437,8 @@ def step(self, action):
assert m_power >= 0.5 and m_power <= 1.0
else:
m_power = 1.0
- ox = (
- tip[0] * (4 / SCALE + 2 * dispersion[0]) + side[0] * dispersion[1]
- ) # 4 is move a bit downwards, +-2 for randomness
+ # 4 is move a bit downwards, +-2 for randomness
+ ox = tip[0] * (4 / SCALE + 2 * dispersion[0]) + side[0] * dispersion[1]
oy = -tip[1] * (4 / SCALE + 2 * dispersion[0]) - side[1] * dispersion[1]
impulse_pos = (self.lander.position[0] + ox, self.lander.position[1] + oy)
p = self._create_particle(
@@ -608,6 +702,10 @@ def heuristic(env, s):
def demo_heuristic_lander(env, seed=None, render=False):
+
+ # wind power must be reduced for heuristic landing
+ env.wind_power = 0.2
+
total_reward = 0
steps = 0
s = env.reset(seed=seed)
diff --git a/gym/envs/classic_control/continuous_mountain_car.py b/gym/envs/classic_control/continuous_mountain_car.py
index 8918b38ef8c..b7a45e63c2d 100644
--- a/gym/envs/classic_control/continuous_mountain_car.py
+++ b/gym/envs/classic_control/continuous_mountain_car.py
@@ -125,7 +125,7 @@ def __init__(self, goal_velocity=0):
low=self.low_state, high=self.high_state, dtype=np.float32
)
- def step(self, action):
+ def step(self, action: np.ndarray):
position = self.state[0]
velocity = self.state[1]
diff --git a/gym/envs/classic_control/mountain_car.py b/gym/envs/classic_control/mountain_car.py
index 0f49af6a6f0..6c9a555b517 100644
--- a/gym/envs/classic_control/mountain_car.py
+++ b/gym/envs/classic_control/mountain_car.py
@@ -112,7 +112,7 @@ def __init__(self, goal_velocity=0):
self.action_space = spaces.Discrete(3)
self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)
- def step(self, action):
+ def step(self, action: int):
assert self.action_space.contains(
action
), f"{action!r} ({type(action)}) invalid"
diff --git a/gym/envs/mujoco/__init__.py b/gym/envs/mujoco/__init__.py
index c81f0dec3fe..f4e5ee2eb14 100644
--- a/gym/envs/mujoco/__init__.py
+++ b/gym/envs/mujoco/__init__.py
@@ -8,6 +8,7 @@
from gym.envs.mujoco.inverted_double_pendulum import InvertedDoublePendulumEnv
from gym.envs.mujoco.inverted_pendulum import InvertedPendulumEnv
from gym.envs.mujoco.mujoco_env import MujocoEnv
+from gym.envs.mujoco.mujoco_rendering import RenderContextOffscreen, Viewer
from gym.envs.mujoco.pusher import PusherEnv
from gym.envs.mujoco.reacher import ReacherEnv
from gym.envs.mujoco.swimmer import SwimmerEnv
diff --git a/gym/envs/mujoco/ant.py b/gym/envs/mujoco/ant.py
index e61b787db98..70728dee5ea 100644
--- a/gym/envs/mujoco/ant.py
+++ b/gym/envs/mujoco/ant.py
@@ -6,7 +6,7 @@
class AntEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "ant.xml", 5)
+ mujoco_env.MujocoEnv.__init__(self, "ant.xml", 5, mujoco_bindings="mujoco_py")
utils.EzPickle.__init__(self)
def step(self, a):
diff --git a/gym/envs/mujoco/ant_v3.py b/gym/envs/mujoco/ant_v3.py
index aeffa507523..4260fd7d2f0 100644
--- a/gym/envs/mujoco/ant_v3.py
+++ b/gym/envs/mujoco/ant_v3.py
@@ -9,167 +9,6 @@
class AntEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is based on the environment introduced by Schulman,
- Moritz, Levine, Jordan and Abbeel in ["High-Dimensional Continuous Control
- Using Generalized Advantage Estimation"](https://arxiv.org/abs/1506.02438).
- The ant is a 3D robot consisting of one torso (free rotational body) with
- four legs attached to it with each leg having two links. The goal is to
- coordinate the four legs to move in the forward (right) direction by applying
- torques on the eight hinges connecting the two links of each leg and the torso
- (nine parts and eight hinges).
-
- ### Action Space
- The action space is a `Box(-1, 1, (8,), float32)`. An action represents the torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
- | 0 | Torque applied on the rotor between the torso and front left hip | -1 | 1 | hip_1 (front_left_leg) | hinge | torque (N m) |
- | 1 | Torque applied on the rotor between the front left two links | -1 | 1 | angle_1 (front_left_leg) | hinge | torque (N m) |
- | 2 | Torque applied on the rotor between the torso and front right hip | -1 | 1 | hip_2 (front_right_leg) | hinge | torque (N m) |
- | 3 | Torque applied on the rotor between the front right two links | -1 | 1 | angle_2 (front_right_leg) | hinge | torque (N m) |
- | 4 | Torque applied on the rotor between the torso and back left hip | -1 | 1 | hip_3 (back_leg) | hinge | torque (N m) |
- | 5 | Torque applied on the rotor between the back left two links | -1 | 1 | angle_3 (back_leg) | hinge | torque (N m) |
- | 6 | Torque applied on the rotor between the torso and back right hip | -1 | 1 | hip_4 (right_back_leg) | hinge | torque (N m) |
- | 7 | Torque applied on the rotor between the back right two links | -1 | 1 | angle_4 (right_back_leg) | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of positional values of different body parts of the ant,
- followed by the velocities of those individual parts (their derivatives) with all
- the positions ordered before all the velocities.
-
- By default, observations do not include the x- and y-coordinates of the ant's torso. These may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- In that case, the observation space will have 113 dimensions where the first two dimensions
- represent the x- and y- coordinates of the ant's torso.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
- of the torso will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
-
- However, by default, an observation is a `ndarray` with shape `(111,)`
- where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|-------------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
- | 0 | z-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) |
- | 1 | x-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
- | 2 | y-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
- | 3 | z-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
- | 4 | w-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
- | 5 | angle between torso and first link on front left | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
- | 6 | angle between the two links on the front left | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
- | 7 | angle between torso and first link on front right | -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
- | 8 | angle between the two links on the front right | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
- | 9 | angle between torso and first link on back left | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
- | 10 | angle between the two links on the back left | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
- | 11 | angle between torso and first link on back right | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
- | 12 | angle between the two links on the back right | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
- | 13 | x-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
- | 14 | y-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
- | 15 | z-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
- | 16 | x-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
- | 17 | y-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
- | 18 | z-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
- | 19 | angular velocity of angle between torso and front left link | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
- | 20 | angular velocity of the angle between front left links | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
- | 21 | angular velocity of angle between torso and front right link| -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
- | 22 | angular velocity of the angle between front right links | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
- | 23 | angular velocity of angle between torso and back left link | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
- | 24 | angular velocity of the angle between back left links | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
- | 25 | angular velocity of angle between torso and back right link | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
- | 26 |angular velocity of the angle between back right links | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
-
-
- The remaining 14*6 = 84 elements of the observation are contact forces
- (external forces - force x, y, z and torque x, y, z) applied to the
- center of mass of each of the links. The 14 links are: the ground link,
- the torso link, and 3 links for each leg (1 + 1 + 12) with the 6 external forces.
-
- The (x,y,z) coordinates are translational DOFs while the orientations are rotational
- DOFs expressed as quaternions. One can read more about free joints on the [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
-
-
- **Note:** There have been reported issues that using a Mujoco-Py version > 2.0 results
- in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
- when using the Ant environment if you would like to report results with contact forces (if
- contact forces are not used in your experiments, you can use version > 2.0).
-
- ### Rewards
- The reward consists of three parts:
- - *healthy_reward*: Every timestep that the ant is healthy (see definition in section "Episode Termination"), it gets a reward of fixed value `healthy_reward`
- - *forward_reward*: A reward of moving forward which is measured as
- *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is the time
- between actions and is dependent on the `frame_skip` parameter (default is 5),
- where the frametime is 0.01 - making the default *dt = 5 * 0.01 = 0.05*.
- This reward would be positive if the ant moves forward (in positive x direction).
- - *ctrl_cost*: A negative reward for penalising the ant if it takes actions
- that are too large. It is measured as *`ctrl_cost_weight` * sum(action2)*
- where *`ctr_cost_weight`* is a parameter set for the control and has a default value of 0.5.
- - *contact_cost*: A negative reward for penalising the ant if the external contact
- force is too large. It is calculated *`contact_cost_weight` * sum(clip(external contact
- force to `contact_force_range`)2)*.
-
- The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost - contact_cost* and `info` will also contain the individual reward terms.
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 0.75, 1.0, 0.0 ... 0.0) with a uniform noise in the range
- of [-`reset_noise_scale`, `reset_noise_scale`] added to the positional values and standard normal noise
- with mean 0 and standard deviation `reset_noise_scale` added to the velocity values for
- stochasticity. Note that the initial z coordinate is intentionally selected
- to be slightly high, thereby indicating a standing up ant. The initial orientation
- is designed to make it face forward as well.
-
- ### Episode Termination
- The ant is said to be unhealthy if any of the following happens:
-
- 1. Any of the state space values is no longer finite
- 2. The z-coordinate of the torso is **not** in the closed interval given by `healthy_z_range` (defaults to [0.2, 1.0])
-
- If `terminate_when_unhealthy=True` is passed during construction (which is the default),
- the episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 1000 timesteps
- 2. The ant is unhealthy
-
- If `terminate_when_unhealthy=False` is passed, the episode is terminated only when 1000 timesteps are exceeded.
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- env = gym.make('Ant-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('Ant-v3', ctrl_cost_weight=0.1, ...)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|--------------|-------------------------------|
- | `xml_file` | **str** | `"ant.xml"` | Path to a MuJoCo model |
- | `ctrl_cost_weight` | **float** | `0.5` | Weight for *ctrl_cost* term (see section on reward) |
- | `contact_cost_weight` | **float** | `5e-4` | Weight for *contact_cost* term (see section on reward) |
- | `healthy_reward` | **float** | `1` | Constant reward given if the ant is "healthy" after timestep |
- | `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the z-coordinate of the torso is no longer in the `healthy_z_range` |
- | `healthy_z_range` | **tuple** | `(0.2, 1)` | The ant is considered healthy if the z-coordinate of the torso is in this range |
- | `contact_force_range` | **tuple** | `(-1, 1)` | Contact forces are clipped to this range in the computation of *contact_cost* |
- | `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
- """
-
def __init__(
self,
xml_file="ant.xml",
@@ -199,7 +38,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5, mujoco_bindings="mujoco_py")
@property
def healthy_reward(self):
diff --git a/gym/envs/mujoco/ant_v4.py b/gym/envs/mujoco/ant_v4.py
new file mode 100644
index 00000000000..ff30f4c9022
--- /dev/null
+++ b/gym/envs/mujoco/ant_v4.py
@@ -0,0 +1,298 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {
+ "distance": 4.0,
+}
+
+
+class AntEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is based on the environment introduced by Schulman,
+ Moritz, Levine, Jordan and Abbeel in ["High-Dimensional Continuous Control
+ Using Generalized Advantage Estimation"](https://arxiv.org/abs/1506.02438).
+ The ant is a 3D robot consisting of one torso (free rotational body) with
+ four legs attached to it with each leg having two links. The goal is to
+ coordinate the four legs to move in the forward (right) direction by applying
+ torques on the eight hinges connecting the two links of each leg and the torso
+ (nine parts and eight hinges).
+
+ ### Action Space
+ The agent take a 8-element vector for actions.
+
+ The action space is a continuous `(action, action, action, action, action, action,
+ action, action)` all in `[-1, 1]`, where `action` represents the numerical torques
+ applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
+ | 0 | Torque applied on the rotor between the torso and front left hip | -1 | 1 | hip_1 (front_left_leg) | hinge | torque (N m) |
+ | 1 | Torque applied on the rotor between the front left two links | -1 | 1 | angle_1 (front_left_leg) | hinge | torque (N m) |
+ | 2 | Torque applied on the rotor between the torso and front right hip | -1 | 1 | hip_2 (front_right_leg) | hinge | torque (N m) |
+ | 3 | Torque applied on the rotor between the front right two links | -1 | 1 | angle_2 (front_right_leg) | hinge | torque (N m) |
+ | 4 | Torque applied on the rotor between the torso and back left hip | -1 | 1 | hip_3 (back_leg) | hinge | torque (N m) |
+ | 5 | Torque applied on the rotor between the back left two links | -1 | 1 | angle_3 (back_leg) | hinge | torque (N m) |
+ | 6 | Torque applied on the rotor between the torso and back right hip | -1 | 1 | hip_4 (right_back_leg) | hinge | torque (N m) |
+ | 7 | Torque applied on the rotor between the back right two links | -1 | 1 | angle_4 (right_back_leg) | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the ant,
+ followed by the velocities of those individual parts (their derivatives) with all
+ the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(111,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|-------------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
+ | 0 | x-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) |
+ | 1 | y-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) |
+ | 2 | z-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) |
+ | 3 | x-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
+ | 4 | y-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
+ | 5 | z-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
+ | 6 | w-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
+ | 7 | angle between torso and first link on front left | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
+ | 8 | angle between the two links on the front left | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
+ | 9 | angle between torso and first link on front right | -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
+ | 10 | angle between the two links on the front right | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
+ | 11 | angle between torso and first link on back left | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
+ | 12 | angle between the two links on the back left | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
+ | 13 | angle between torso and first link on back right | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
+ | 14 | angle between the two links on the back right | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
+ | 15 | x-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
+ | 16 | y-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
+ | 17 | z-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
+ | 18 | x-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
+ | 19 | y-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
+ | 20 | z-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
+ | 21 | angular velocity of angle between torso and front left link | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
+ | 22 | angular velocity of the angle between front left links | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
+ | 23 | angular velocity of angle between torso and front right link| -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
+ | 24 | angular velocity of the angle between front right links | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
+ | 25 | angular velocity of angle between torso and back left link | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
+ | 26 | angular velocity of the angle between back left links | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
+ | 27 | angular velocity of angle between torso and back right link | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
+ | 28 |angular velocity of the angle between back right links | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
+
+
+ The remaining 14*6 = 84 elements in the state are contact forces
+ (external forces - force x, y, z and torque x, y, z) applied to the
+ center of mass of each of the links. The 14 links are: the ground link,
+ the torso link, and 3 links for each leg (1 + 1 + 12) with the 6 external forces.
+
+ The (x,y,z) coordinates are translational DOFs while the orientations are rotational
+ DOFs expressed as quaternions. One can read more about free joints on the [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
+
+ **Note:** There are 29 elements in the table above - giving rise to `(113,)` elements
+ in the state space. In practice (and Gym implementation), the first two positional
+ elements are omitted from the state space since the reward function is calculated based
+ on the x-coordinate value. This value is hidden from the algorithm, which in turn has to
+ develop an abstract understanding of it from the observed rewards. Therefore, observation
+ space has shape `(111,)` instead of `(113,)` and the table should not have the first two rows.
+
+ **Note:** Ant-v4 environment no longer has the following contact forces issue.
+ If using previous Ant versions from v4, there have been reported issues that using a Mujoco-Py version > 2.0 results
+ in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
+ when using the Ant environment if you would like to report results with contact forces (if
+ contact forces are not used in your experiments, you can use version > 2.0).
+
+ ### Rewards
+ The reward consists of three parts:
+ - *survive_reward*: Every timestep that the ant is alive, it gets a reward of 1.
+ - *forward_reward*: A reward of moving forward which is measured as
+ *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is the time
+ between actions and is dependent on the frame_skip parameter (default is 5),
+ where the *dt* for one frame is 0.01 - making the default *dt = 5 * 0.01 = 0.05*.
+ This reward would be positive if the ant moves forward (right) desired.
+ - *ctrl_cost*: A negative reward for penalising the ant if it takes actions
+ that are too large. It is measured as *coefficient **x** sum(action2)*
+ where *coefficient* is a parameter set for the control and has a default value of 0.5.
+ - *contact_cost*: A negative reward for penalising the ant if the external contact
+ force is too large. It is calculated *0.5 * 0.001 * sum(clip(external contact
+ force to [-1,1])2)*.
+
+ The total reward returned is ***reward*** *=* *alive survive_reward + forward_reward - ctrl_cost - contact_cost*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 0.75, 1.0, 0.0 ... 0.0) with a uniform noise in the range
+ of [-0.1, 0.1] added to the positional values and standard normal noise
+ with 0 mean and 0.1 standard deviation added to the velocity values for
+ stochasticity. Note that the initial z coordinate is intentionally selected
+ to be slightly high, thereby indicating a standing up ant. The initial orientation
+ is designed to make it face forward as well.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 1000 timesteps
+ 2. Any of the state space values is no longer finite
+ 3. The y-orientation (index 2) in the state is **not** in the range `[0.2, 1.0]`
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but modifications
+ can be made to the XML file in the assets folder (or by changing the path to a modified
+ XML file in another folder).
+
+ ```
+ env = gym.make('Ant-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('Ant-v3', ctrl_cost_weight=0.1, ...)
+ ```
+
+ ### Version History
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+ """
+
+ def __init__(
+ self,
+ xml_file="ant.xml",
+ ctrl_cost_weight=0.5,
+ contact_cost_weight=5e-4,
+ healthy_reward=1.0,
+ terminate_when_unhealthy=True,
+ healthy_z_range=(0.2, 1.0),
+ contact_force_range=(-1.0, 1.0),
+ reset_noise_scale=0.1,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._ctrl_cost_weight = ctrl_cost_weight
+ self._contact_cost_weight = contact_cost_weight
+
+ self._healthy_reward = healthy_reward
+ self._terminate_when_unhealthy = terminate_when_unhealthy
+ self._healthy_z_range = healthy_z_range
+
+ self._contact_force_range = contact_force_range
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+
+ @property
+ def healthy_reward(self):
+ return (
+ float(self.is_healthy or self._terminate_when_unhealthy)
+ * self._healthy_reward
+ )
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
+ return control_cost
+
+ @property
+ def contact_forces(self):
+ raw_contact_forces = self.data.cfrc_ext
+ min_value, max_value = self._contact_force_range
+ contact_forces = np.clip(raw_contact_forces, min_value, max_value)
+ return contact_forces
+
+ @property
+ def contact_cost(self):
+ contact_cost = self._contact_cost_weight * np.sum(
+ np.square(self.contact_forces)
+ )
+ return contact_cost
+
+ @property
+ def is_healthy(self):
+ state = self.state_vector()
+ min_z, max_z = self._healthy_z_range
+ is_healthy = np.isfinite(state).all() and min_z <= state[2] <= max_z
+ return is_healthy
+
+ @property
+ def done(self):
+ done = not self.is_healthy if self._terminate_when_unhealthy else False
+ return done
+
+ def step(self, action):
+ xy_position_before = self.get_body_com("torso")[:2].copy()
+ self.do_simulation(action, self.frame_skip)
+ xy_position_after = self.get_body_com("torso")[:2].copy()
+
+ xy_velocity = (xy_position_after - xy_position_before) / self.dt
+ x_velocity, y_velocity = xy_velocity
+
+ ctrl_cost = self.control_cost(action)
+ contact_cost = self.contact_cost
+
+ forward_reward = x_velocity
+ healthy_reward = self.healthy_reward
+
+ rewards = forward_reward + healthy_reward
+ costs = ctrl_cost + contact_cost
+
+ reward = rewards - costs
+ done = self.done
+ observation = self._get_obs()
+ info = {
+ "reward_forward": forward_reward,
+ "reward_ctrl": -ctrl_cost,
+ "reward_contact": -contact_cost,
+ "reward_survive": healthy_reward,
+ "x_position": xy_position_after[0],
+ "y_position": xy_position_after[1],
+ "distance_from_origin": np.linalg.norm(xy_position_after, ord=2),
+ "x_velocity": x_velocity,
+ "y_velocity": y_velocity,
+ "forward_reward": forward_reward,
+ }
+
+ return observation, reward, done, info
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = self.data.qvel.flat.copy()
+ contact_force = self.contact_forces.flat.copy()
+
+ if self._exclude_current_positions_from_observation:
+ position = position[2:]
+
+ observations = np.concatenate((position, velocity, contact_force))
+
+ return observations
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = (
+ self.init_qvel
+ + self._reset_noise_scale * self.np_random.standard_normal(self.model.nv)
+ )
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/mujoco/half_cheetah.py b/gym/envs/mujoco/half_cheetah.py
index 53a206fb656..2c32de0b0a9 100644
--- a/gym/envs/mujoco/half_cheetah.py
+++ b/gym/envs/mujoco/half_cheetah.py
@@ -6,7 +6,9 @@
class HalfCheetahEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "half_cheetah.xml", 5)
+ mujoco_env.MujocoEnv.__init__(
+ self, "half_cheetah.xml", 5, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def step(self, action):
diff --git a/gym/envs/mujoco/half_cheetah_v3.py b/gym/envs/mujoco/half_cheetah_v3.py
index 64104f867a8..c6da69f2cc7 100644
--- a/gym/envs/mujoco/half_cheetah_v3.py
+++ b/gym/envs/mujoco/half_cheetah_v3.py
@@ -10,125 +10,6 @@
class HalfCheetahEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is based on the work by P. Wawrzy´nski in
- ["A Cat-Like Robot Real-Time Learning to Run"](http://staff.elka.pw.edu.pl/~pwawrzyn/pub-s/0812_LSCLRR.pdf).
- The HalfCheetah is a 2-dimensional robot consisting of 9 links and 8
- joints connecting them (including two paws). The goal is to apply a torque
- on the joints to make the cheetah run forward (right) as fast as possible,
- with a positive reward allocated based on the distance moved forward and a
- negative reward allocated for moving backward. The torso and head of the
- cheetah are fixed, and the torque can only be applied on the other 6 joints
- over the front and back thighs (connecting to the torso), shins
- (connecting to the thighs) and feet (connecting to the shins).
-
- ### Action Space
- The action space is a `Box(-1, 1, (6,), float32)`. An action represents the torques applied between *links*.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-------|--------------------------------------|---------------|----------------|---------------------------------------|-------|------|
- | 0 | Torque applied on the back thigh rotor | -1 | 1 | bthigh | hinge | torque (N m) |
- | 1 | Torque applied on the back shin rotor | -1 | 1 | bshin | hinge | torque (N m) |
- | 2 | Torque applied on the back foot rotor | -1 | 1 | bfoot | hinge | torque (N m) |
- | 3 | Torque applied on the front thigh rotor| -1 | 1 | fthigh | hinge | torque (N m) |
- | 4 | Torque applied on the front shin rotor | -1 | 1 | fshin | hinge | torque (N m) |
- | 5 | Torque applied on the front foot rotor | -1 | 1 | ffoot | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of positional values of different body parts of the
- cheetah, followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
-
- By default, observations do not include the x-coordinate of the cheetah's center of mass. It may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- In that case, the observation space will have 18 dimensions where the first dimension
- represents the x-coordinate of the cheetah's center of mass.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
- will be returned in `info` with key `"x_position"`.
-
- However, by default, the observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
-
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | z-coordinate of the front tip | -Inf | Inf | rootz | slide | position (m) |
- | 1 | angle of the front tip | -Inf | Inf | rooty | hinge | angle (rad) |
- | 2 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angle (rad) |
- | 3 | angle of the second rotor | -Inf | Inf | bshin | hinge | angle (rad) |
- | 4 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angle (rad) |
- | 5 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge | angle (rad) |
- | 6 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angle (rad) |
- | 7 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angle (rad) |
- | 8 | x-coordinate of the front tip | -Inf | Inf | rootx | slide | velocity (m/s) |
- | 9 | y-coordinate of the front tip | -Inf | Inf | rootz | slide | velocity (m/s) |
- | 10 | angle of the front tip | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
- | 11 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angular velocity (rad/s) |
- | 12 | angle of the second rotor | -Inf | Inf | bshin | hinge | angular velocity (rad/s) |
- | 13 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angular velocity (rad/s) |
- | 14 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge |angular velocity (rad/s) |
- | 15 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angular velocity (rad/s) |
- | 16 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angular velocity (rad/s) |
-
- ### Rewards
- The reward consists of two parts:
- - *forward_reward*: A reward of moving forward which is measured
- as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
- the time between actions and is dependent on the frame_skip parameter
- (fixed to 5), where the frametime is 0.01 - making the
- default *dt = 5 * 0.01 = 0.05*. This reward would be positive if the cheetah
- runs forward (right).
- - *ctrl_cost*: A cost for penalising the cheetah if it takes
- actions that are too large. It is measured as *`ctrl_cost_weight` *
- sum(action2)* where *`ctrl_cost_weight`* is a parameter set for the
- control and has a default value of 0.1
-
- The total reward returned is ***reward*** *=* *forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
-
- ### Starting State
- All observations start in state (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
- 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,) with a noise added to the
- initial state for stochasticity. As seen before, the first 8 values in the
- state are positional and the last 9 values are velocity. A uniform noise in
- the range of [-`reset_noise_scale`, `reset_noise_scale`] is added to the positional values while a standard
- normal noise with a mean of 0 and standard deviation of `reset_noise_scale` is added to the
- initial velocity values of all zeros.
-
- ### Episode Termination
- The episode terminates when the episode length is greater than 1000.
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- env = gym.make('HalfCheetah-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('HalfCheetah-v3', ctrl_cost_weight=0.1, ....)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|----------------------|-------------------------------|
- | `xml_file` | **str** | `"half_cheetah.xml"` | Path to a MuJoCo model |
- | `forward_reward_weight` | **float** | `1.0` | Weight for *forward_reward* term (see section on reward) |
- | `ctrl_cost_weight` | **float** | `0.1` | Weight for *ctrl_cost* weight (see section on reward) |
- | `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True` | Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
- """
-
def __init__(
self,
xml_file="half_cheetah.xml",
@@ -149,7 +30,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5, mujoco_bindings="mujoco_py")
def control_cost(self, action):
control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
diff --git a/gym/envs/mujoco/half_cheetah_v4.py b/gym/envs/mujoco/half_cheetah_v4.py
new file mode 100644
index 00000000000..e7454855f4c
--- /dev/null
+++ b/gym/envs/mujoco/half_cheetah_v4.py
@@ -0,0 +1,230 @@
+__credits__ = ["Rushiv Arora"]
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {
+ "distance": 4.0,
+}
+
+
+class HalfCheetahEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is based on the work by P. Wawrzy´nski in
+ ["A Cat-Like Robot Real-Time Learning to Run"](http://staff.elka.pw.edu.pl/~pwawrzyn/pub-s/0812_LSCLRR.pdf).
+ The HalfCheetah is a 2-dimensional robot consisting of 9 links and 8
+ joints connecting them (including two paws). The goal is to apply a torque
+ on the joints to make the cheetah run forward (right) as fast as possible,
+ with a positive reward allocated based on the distance moved forward and a
+ negative reward allocated for moving backward. The torso and head of the
+ cheetah are fixed, and the torque can only be applied on the other 6 joints
+ over the front and back thighs (connecting to the torso), shins
+ (connecting to the thighs) and feet (connecting to the shins).
+
+ ### Action Space
+ The agent take a 6-element vector for actions.
+ The action space is a continuous `(action, action, action, action, action, action)` all in `[-1.0, 1.0]`, where `action` represents the numerical torques applied between *links*
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-------|--------------------------------------|---------------|----------------|---------------------------------------|-------|------|
+ | 0 | Torque applied on the back thigh rotor | -1 | 1 | bthigh | hinge | torque (N m) |
+ | 1 | Torque applied on the back shin rotor | -1 | 1 | bshin | hinge | torque (N m) |
+ | 2 | Torque applied on the back foot rotor | -1 | 1 | bfoot | hinge | torque (N m) |
+ | 3 | Torque applied on the front thigh rotor| -1 | 1 | fthigh | hinge | torque (N m) |
+ | 4 | Torque applied on the front shin rotor | -1 | 1 | fshin | hinge | torque (N m) |
+ | 5 | Torque applied on the front foot rotor | -1 | 1 | ffoot | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the
+ cheetah, followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | x-coordinate of the center of mass | -Inf | Inf | rootx | slide | position (m) |
+ | 1 | y-coordinate of the center of mass | -Inf | Inf | rootz | slide | position (m) |
+ | 2 | angle of the front tip | -Inf | Inf | rooty | hinge | angle (rad) |
+ | 3 | angle of the back thigh rotor | -Inf | Inf | bthigh | hinge | angle (rad) |
+ | 4 | angle of the back shin rotor | -Inf | Inf | bshin | hinge | angle (rad) |
+ | 5 | angle of the back foot rotor | -Inf | Inf | bfoot | hinge | angle (rad) |
+ | 6 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge | angle (rad) |
+ | 7 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angle (rad) |
+ | 8 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angle (rad) |
+ | 9 | x-coordinate of the front tip | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 10 | y-coordinate of the front tip | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 11 | angle of the front tip | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 12 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angular velocity (rad/s) |
+ | 13 | angle of the second rotor | -Inf | Inf | bshin | hinge | angular velocity (rad/s) |
+ | 14 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angular velocity (rad/s) |
+ | 15 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge |angular velocity (rad/s) |
+ | 16 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angular velocity (rad/s) |
+ | 17 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angular velocity (rad/s) |
+
+
+ **Note:**
+ In practice (and Gym implementation), the first positional element is
+ omitted from the state space since the reward function is calculated based
+ on that value. This value is hidden from the algorithm, which in turn has
+ to develop an abstract understanding of it from the observed rewards.
+ Therefore, observation space has shape `(8,)` and looks like:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | y-coordinate of the front tip | -Inf | Inf | rootz | slide | position (m) |
+ | 1 | angle of the front tip | -Inf | Inf | rooty | hinge | angle (rad) |
+ | 2 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angle (rad) |
+ | 3 | angle of the second rotor | -Inf | Inf | bshin | hinge | angle (rad) |
+ | 4 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angle (rad) |
+ | 5 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge | angle (rad) |
+ | 6 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angle (rad) |
+ | 7 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angle (rad) |
+ | 8 | x-coordinate of the front tip | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 9 | y-coordinate of the front tip | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 10 | angle of the front tip | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 11 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angular velocity (rad/s) |
+ | 12 | angle of the second rotor | -Inf | Inf | bshin | hinge | angular velocity (rad/s) |
+ | 13 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angular velocity (rad/s) |
+ | 14 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge |angular velocity (rad/s) |
+ | 15 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angular velocity (rad/s) |
+ | 16 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angular velocity (rad/s) |
+
+ ### Rewards
+ The reward consists of two parts:
+ - *reward_run*: A reward of moving forward which is measured
+ as *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is
+ the time between actions and is dependeent on the frame_skip parameter
+ (default is 5), where the *dt* for one frame is 0.01 - making the
+ default *dt = 5*0.01 = 0.05*. This reward would be positive if the cheetah
+ runs forward (right) desired.
+ - *reward_control*: A negative reward for penalising the cheetah if it takes
+ actions that are too large. It is measured as *-coefficient x
+ sum(action2)* where *coefficient* is a parameter set for the
+ control and has a default value of 0.1
+
+ The total reward returned is ***reward*** *=* *reward_run + reward_control*
+
+ ### Starting State
+ All observations start in state (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
+ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,) with a noise added to the
+ initial state for stochasticity. As seen before, the first 8 values in the
+ state are positional and the last 9 values are velocity. A uniform noise in
+ the range of [-0.1, 0.1] is added to the positional values while a standard
+ normal noise with a mean of 0 and standard deviation of 0.1 is added to the
+ initial velocity values of all zeros.
+
+ ### Episode Termination
+ The episode terminates when the episode length is greater than 1000.
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but
+ modifications can be made to the XML file in the assets folder at
+ `gym/envs/mujoco/assets/half_cheetah.xml` (or by changing the path to a
+ modified XML file in another folder).
+
+ ```
+ env = gym.make('HalfCheetah-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('HalfCheetah-v3', ctrl_cost_weight=0.1, ....)
+ ```
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+ """
+
+ def __init__(
+ self,
+ xml_file="half_cheetah.xml",
+ forward_reward_weight=1.0,
+ ctrl_cost_weight=0.1,
+ reset_noise_scale=0.1,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._forward_reward_weight = forward_reward_weight
+
+ self._ctrl_cost_weight = ctrl_cost_weight
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
+ return control_cost
+
+ def step(self, action):
+ x_position_before = self.data.qpos[0]
+ self.do_simulation(action, self.frame_skip)
+ x_position_after = self.data.qpos[0]
+ x_velocity = (x_position_after - x_position_before) / self.dt
+
+ ctrl_cost = self.control_cost(action)
+
+ forward_reward = self._forward_reward_weight * x_velocity
+
+ observation = self._get_obs()
+ reward = forward_reward - ctrl_cost
+ done = False
+ info = {
+ "x_position": x_position_after,
+ "x_velocity": x_velocity,
+ "reward_run": forward_reward,
+ "reward_ctrl": -ctrl_cost,
+ }
+
+ return observation, reward, done, info
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = self.data.qvel.flat.copy()
+
+ if self._exclude_current_positions_from_observation:
+ position = position[1:]
+
+ observation = np.concatenate((position, velocity)).ravel()
+ return observation
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = (
+ self.init_qvel
+ + self._reset_noise_scale * self.np_random.standard_normal(self.model.nv)
+ )
+
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/mujoco/hopper.py b/gym/envs/mujoco/hopper.py
index ad459bda2a2..34c69be720f 100644
--- a/gym/envs/mujoco/hopper.py
+++ b/gym/envs/mujoco/hopper.py
@@ -6,7 +6,9 @@
class HopperEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "hopper.xml", 4)
+ mujoco_env.MujocoEnv.__init__(
+ self, "hopper.xml", 4, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def step(self, a):
diff --git a/gym/envs/mujoco/hopper_v3.py b/gym/envs/mujoco/hopper_v3.py
index 807b6976f1b..cd16b07bc91 100644
--- a/gym/envs/mujoco/hopper_v3.py
+++ b/gym/envs/mujoco/hopper_v3.py
@@ -14,130 +14,6 @@
class HopperEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is based on the work done by Erez, Tassa, and Todorov in
- ["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf). The environment aims to
- increase the number of independent state and control variables as compared to
- the classic control environments. The hopper is a two-dimensional
- one-legged figure that consist of four main body parts - the torso at the
- top, the thigh in the middle, the leg in the bottom, and a single foot on
- which the entire body rests. The goal is to make hops that move in the
- forward (right) direction by applying torques on the three hinges
- connecting the four body parts.
-
- ### Action Space
- The action space is a `Box(-1, 1, (3,), float32)`. An action represents the torques applied between *links*
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
- | 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
- | 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
- | 3 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of positional values of different body parts of the
- hopper, followed by the velocities of those individual parts
- (their derivatives) with all the positions ordered before all the velocities.
-
- By default, observations do not include the x-coordinate of the hopper. It may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- In that case, the observation space will have 12 dimensions where the first dimension
- represents the x-coordinate of the hopper.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
- will be returned in `info` with key `"x_position"`.
-
- However, by default, the observation is a `ndarray` with shape `(11,)` where the elements
- correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz | slide | position (m) |
- | 1 | angle of the top | -Inf | Inf | rooty | hinge | angle (rad) |
- | 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
- | 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
- | 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
- | 5 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
- | 6 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
- | 7 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
- | 8 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
- | 9 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
- | 10 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
-
- ### Rewards
- The reward consists of three parts:
- - *healthy_reward*: Every timestep that the hopper is healthy (see definition in section "Episode Termination"), it gets a reward of fixed value `healthy_reward`.
- - *forward_reward*: A reward of hopping forward which is measured
- as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
- the time between actions and is dependent on the frame_skip parameter
- (fixed to 4), where the frametime is 0.002 - making the
- default *dt = 4 * 0.002 = 0.008*. This reward would be positive if the hopper
- hops forward (positive x direction).
- - *ctrl_cost*: A cost for penalising the hopper if it takes
- actions that are too large. It is measured as *`ctrl_cost_weight` *
- sum(action2)* where *`ctrl_cost_weight`* is a parameter set for the
- control and has a default value of 0.001
-
- The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
-
- ### Starting State
- All observations start in state
- (0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise
- in the range of [-`reset_noise_scale`, `reset_noise_scale`] added to the values for stochasticity.
-
- ### Episode Termination
- The hopper is said to be unhealthy if any of the following happens:
-
- 1. An element of `observation[1:]` (if `exclude_current_positions_from_observation=True`, else `observation[2:]`) is no longer contained in the closed interval specified by the argument `healthy_state_range`
- 2. The height of the hopper (`observation[0]` if `exclude_current_positions_from_observation=True`, else `observation[1]`) is no longer contained in the closed interval specified by the argument `healthy_z_range` (usually meaning that it has fallen)
- 3. The angle (`observation[1]` if `exclude_current_positions_from_observation=True`, else `observation[2]`) is no longer contained in the closed interval specified by the argument `healthy_angle_range`
-
- If `terminate_when_unhealthy=True` is passed during construction (which is the default),
- the episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 1000 timesteps
- 2. The hopper is unhealthy
-
- If `terminate_when_unhealthy=False` is passed, the episode is terminated only when 1000 timesteps are exceeded.
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- env = gym.make('Hopper-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('Hopper-v3', ctrl_cost_weight=0.1, ....)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|----------------|-------------------------------|
- | `xml_file` | **str** | `"hopper.xml"` | Path to a MuJoCo model |
- | `forward_reward_weight` | **float** | `1.0` | Weight for *forward_reward* term (see section on reward) |
- | `ctrl_cost_weight` | **float** | `0.001` | Weight for *ctrl_cost* reward (see section on reward) |
- | `healthy_reward` | **float** | `1` | Constant reward given if the ant is "healthy" after timestep |
- | `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the hopper is no longer healthy |
- | `healthy_state_range` | **tuple** | `(-100, 100)` | The elements of `observation[1:]` (if `exclude_current_positions_from_observation=True`, else `observation[2:]`) must be in this range for the hopper to be considered healthy |
- | `healthy_z_range` | **tuple** | `(0.7, float("inf"))` | The z-coordinate must be in this range for the hopper to be considered healthy |
- | `healthy_angle_range` | **tuple** | `(-0.2, 0.2)` | The angle given by `observation[1]` (if `exclude_current_positions_from_observation=True`, else `observation[2]`) must be in this range for the hopper to be considered healthy |
- | `reset_noise_scale` | **float** | `5e-3` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
- """
-
def __init__(
self,
xml_file="hopper.xml",
@@ -170,7 +46,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4, mujoco_bindings="mujoco_py")
@property
def healthy_reward(self):
diff --git a/gym/envs/mujoco/hopper_v4.py b/gym/envs/mujoco/hopper_v4.py
new file mode 100644
index 00000000000..9c80e929320
--- /dev/null
+++ b/gym/envs/mujoco/hopper_v4.py
@@ -0,0 +1,263 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {
+ "trackbodyid": 2,
+ "distance": 3.0,
+ "lookat": np.array((0.0, 0.0, 1.15)),
+ "elevation": -20.0,
+}
+
+
+class HopperEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is based on the work done by Erez, Tassa, and Todorov in
+ ["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf). The environment aims to
+ increase the number of independent state and control variables as compared to
+ the classic control environments. The hopper is a two-dimensional
+ one-legged figure that consist of four main body parts - the torso at the
+ top, the thigh in the middle, the leg in the bottom, and a single foot on
+ which the entire body rests. The goal is to make hops that move in the
+ forward (right) direction by applying torques on the three hinges
+ connecting the four body parts.
+
+ ### Action Space
+ The agent take a 3-element vector for actions.
+ The action space is a continuous `(action, action, action)` all in `[-1, 1]`
+ , where `action` represents the numerical torques applied between *links*
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
+ | 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
+ | 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
+ | 3 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the
+ hopper, followed by the velocities of those individual parts
+ (their derivatives) with all the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(11,)` where the elements
+ correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | x-coordinate of the top | -Inf | Inf | rootx | slide | position (m) |
+ | 1 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz | slide | position (m) |
+ | 2 | angle of the top | -Inf | Inf | rooty | hinge | angle (rad) |
+ | 3 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
+ | 4 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
+ | 5 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
+ | 6 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 7 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 8 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 9 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
+ | 10 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
+ | 11 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
+
+
+
+ **Note:**
+ In practice (and Gym implementation), the first positional element is
+ omitted from the state space since the reward function is calculated based
+ on that value. This value is hidden from the algorithm, which in turn has
+ to develop an abstract understanding of it from the observed rewards.
+ Therefore, observation space has shape `(11,)` instead of `(12,)` and looks like:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz | slide | position (m) |
+ | 1 | angle of the top | -Inf | Inf | rooty | hinge | angle (rad) |
+ | 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
+ | 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
+ | 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
+ | 5 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 6 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 7 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 8 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
+ | 9 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
+ | 10 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
+
+ ### Rewards
+ The reward consists of three parts:
+ - *alive bonus*: Every timestep that the hopper is alive, it gets a reward of 1,
+ - *reward_forward*: A reward of hopping forward which is measured
+ as *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is
+ the time between actions and is dependeent on the frame_skip parameter
+ (default is 4), where the *dt* for one frame is 0.002 - making the
+ default *dt = 4*0.002 = 0.008*. This reward would be positive if the hopper
+ hops forward (right) desired.
+ - *reward_control*: A negative reward for penalising the hopper if it takes
+ actions that are too large. It is measured as *-coefficient **x**
+ sum(action2)* where *coefficient* is a parameter set for the
+ control and has a default value of 0.001
+
+ The total reward returned is ***reward*** *=* *alive bonus + reward_forward + reward_control*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise
+ in the range of [-0.005, 0.005] added to the values for stochasticity.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 1000 timesteps
+ 2. Any of the state space values is no longer finite
+ 3. The absolute value of any of the state variable indexed (angle and beyond) is greater than 100
+ 4. The height of the hopper becomes greater than 0.7 metres (hopper has hopped too high).
+ 5. The absolute value of the angle (index 2) is less than 0.2 radians (hopper has fallen down).
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but
+ modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder).
+
+ ```
+ env = gym.make('Hopper-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('Hopper-v3', ctrl_cost_weight=0.1, ....)
+ ```
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+ """
+
+ def __init__(
+ self,
+ xml_file="hopper.xml",
+ forward_reward_weight=1.0,
+ ctrl_cost_weight=1e-3,
+ healthy_reward=1.0,
+ terminate_when_unhealthy=True,
+ healthy_state_range=(-100.0, 100.0),
+ healthy_z_range=(0.7, float("inf")),
+ healthy_angle_range=(-0.2, 0.2),
+ reset_noise_scale=5e-3,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._forward_reward_weight = forward_reward_weight
+
+ self._ctrl_cost_weight = ctrl_cost_weight
+
+ self._healthy_reward = healthy_reward
+ self._terminate_when_unhealthy = terminate_when_unhealthy
+
+ self._healthy_state_range = healthy_state_range
+ self._healthy_z_range = healthy_z_range
+ self._healthy_angle_range = healthy_angle_range
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+
+ @property
+ def healthy_reward(self):
+ return (
+ float(self.is_healthy or self._terminate_when_unhealthy)
+ * self._healthy_reward
+ )
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
+ return control_cost
+
+ @property
+ def is_healthy(self):
+ z, angle = self.data.qpos[1:3]
+ state = self.state_vector()[2:]
+
+ min_state, max_state = self._healthy_state_range
+ min_z, max_z = self._healthy_z_range
+ min_angle, max_angle = self._healthy_angle_range
+
+ healthy_state = np.all(np.logical_and(min_state < state, state < max_state))
+ healthy_z = min_z < z < max_z
+ healthy_angle = min_angle < angle < max_angle
+
+ is_healthy = all((healthy_state, healthy_z, healthy_angle))
+
+ return is_healthy
+
+ @property
+ def done(self):
+ done = not self.is_healthy if self._terminate_when_unhealthy else False
+ return done
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = np.clip(self.data.qvel.flat.copy(), -10, 10)
+
+ if self._exclude_current_positions_from_observation:
+ position = position[1:]
+
+ observation = np.concatenate((position, velocity)).ravel()
+ return observation
+
+ def step(self, action):
+ x_position_before = self.data.qpos[0]
+ self.do_simulation(action, self.frame_skip)
+ x_position_after = self.data.qpos[0]
+ x_velocity = (x_position_after - x_position_before) / self.dt
+
+ ctrl_cost = self.control_cost(action)
+
+ forward_reward = self._forward_reward_weight * x_velocity
+ healthy_reward = self.healthy_reward
+
+ rewards = forward_reward + healthy_reward
+ costs = ctrl_cost
+
+ observation = self._get_obs()
+ reward = rewards - costs
+ done = self.done
+ info = {
+ "x_position": x_position_after,
+ "x_velocity": x_velocity,
+ }
+
+ return observation, reward, done, info
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nv
+ )
+
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/mujoco/humanoid.py b/gym/envs/mujoco/humanoid.py
index d025419937e..d69c08eeb99 100644
--- a/gym/envs/mujoco/humanoid.py
+++ b/gym/envs/mujoco/humanoid.py
@@ -12,7 +12,9 @@ def mass_center(model, sim):
class HumanoidEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "humanoid.xml", 5)
+ mujoco_env.MujocoEnv.__init__(
+ self, "humanoid.xml", 5, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def _get_obs(self):
diff --git a/gym/envs/mujoco/humanoid_v3.py b/gym/envs/mujoco/humanoid_v3.py
index a17887b0ff6..b1daaeaf711 100644
--- a/gym/envs/mujoco/humanoid_v3.py
+++ b/gym/envs/mujoco/humanoid_v3.py
@@ -18,198 +18,6 @@ def mass_center(model, sim):
class HumanoidEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is based on the environment introduced by Tassa, Erez and Todorov
- in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
- The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a pair of
- legs and arms. The legs each consist of two links, and so the arms (representing the knees and
- elbows respectively). The goal of the environment is to walk forward as fast as possible without falling over.
-
- ### Action Space
- The action space is a `Box(-1, 1, (17,), float32)`. An action represents the torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
- | 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
- | 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
- | 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
- | 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
- | 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
- | 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
- | 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
- | 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
- | 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
- | 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
- | 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
- | 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
- | 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
- | 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
- | 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
- | 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
- | 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of positional values of different body parts of the Humanoid,
- followed by the velocities of those individual parts (their derivatives) with all the
- positions ordered before all the velocities.
-
- By default, observations do not include the x- and y-coordinates of the torso. These may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- In that case, the observation space will have 378 dimensions where the first two dimensions
- represent the x- and y-coordinates of the torso.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
- will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
-
- However, by default, the observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|---------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
- | 0 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
- | 1 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 2 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 3 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 4 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 5 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
- | 6 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
- | 7 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
- | 8 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
- | 9 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
- | 19 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
- | 11 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
- | 12 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
- | 13 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
- | 14 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
- | 15 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
- | 16 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
- | 17 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
- | 18 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
- | 19 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
- | 20 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
- | 21 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
- | 22 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 23 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 24 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 25 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 26 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 27 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 28 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
- | 29 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
- | 30 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
- | 31 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
- | 32 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
- | 33 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
- | 34 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
- | 35 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
- | 36 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
- | 37 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
- | 38 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
- | 39 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
- | 40 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
- | 41 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
- | 42 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
- | 43 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
- | 44 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
-
- Additionally, after all the positional and velocity based values in the table,
- the observation contains (in order):
- - *cinert:* Mass and inertia of a single rigid body relative to the center of mass
- (this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
- and hence adds to another 140 elements in the state space.
- - *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
- adds another 84 elements in the state space
- - *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
- `(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
- - *cfrc_ext:* This is the center of mass based external force on the body. It has shape
- 14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
- where *nbody* stands for the number of bodies in the robot and *nv* stands for the
- number of degrees of freedom (*= dim(qvel)*)
-
- The (x,y,z) coordinates are translational DOFs while the orientations are rotational
- DOFs expressed as quaternions. One can read more about free joints on the
- [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
-
- **Note:** There have been reported issues that using a Mujoco-Py version > 2.0
- results in the contact forces always being 0. As such we recommend to use a Mujoco-Py
- version < 2.0 when using the Humanoid environment if you would like to report results
- with contact forces (if contact forces are not used in your experiments, you can use
- version > 2.0).
-
- ### Rewards
- The reward consists of three parts:
- - *healthy_reward*: Every timestep that the humanoid is alive (see section Episode Termination for definition), it gets a reward of fixed value `healthy_reward`
- - *forward_reward*: A reward of walking forward which is measured as *`forward_reward_weight` *
- (average center of mass before action - average center of mass after action)/dt*.
- *dt* is the time between actions and is dependent on the frame_skip parameter
- (default is 5), where the frametime is 0.003 - making the default *dt = 5 * 0.003 = 0.015*.
- This reward would be positive if the humanoid walks forward (in positive x-direction). The calculation
- for the center of mass is defined in the `.py` file for the Humanoid.
- - *ctrl_cost*: A negative reward for penalising the humanoid if it has too
- large of a control force. If there are *nu* actuators/controls, then the control has
- shape `nu x 1`. It is measured as *`ctrl_cost_weight` * sum(control2)*.
- - *contact_cost*: A negative reward for penalising the humanoid if the external
- contact force is too large. It is calculated by clipping
- *`contact_cost_weight` * sum(external contact force2)* to the interval specified by `contact_cost_range`.
-
- The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost - contact_cost* and `info` will also contain the individual reward terms
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 1.4, 1.0, 0.0 ... 0.0) with a uniform noise in the range
- of [-`reset_noise_scale`, `reset_noise_scale`] added to the positional and velocity values (values in the table)
- for stochasticity. Note that the initial z coordinate is intentionally
- selected to be high, thereby indicating a standing up humanoid. The initial
- orientation is designed to make it face forward as well.
-
- ### Episode Termination
- The humanoid is said to be unhealthy if the z-position of the torso is no longer contained in the
- closed interval specified by the argument `healthy_z_range`.
-
- If `terminate_when_unhealthy=True` is passed during construction (which is the default),
- the episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 1000 timesteps
- 3. The humanoid is unhealthy
-
- If `terminate_when_unhealthy=False` is passed, the episode is terminated only when 1000 timesteps are exceeded.
-
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- env = gym.make('Humanoid-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('Humanoid-v3', ctrl_cost_weight=0.1, ....)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|-------------------|-------------------------------|
- | `xml_file` | **str** | `"humanoid.xml"` | Path to a MuJoCo model |
- | `forward_reward_weight` | **float** | `1.25` | Weight for *forward_reward* term (see section on reward) |
- | `ctrl_cost_weight` | **float** | `0.1` | Weight for *ctrl_cost* term (see section on reward) |
- | `contact_cost_weight` | **float** | `5e-7` | Weight for *contact_cost* term (see section on reward) |
- | `healthy_reward` | **float** | `5.0` | Constant reward given if the humanoid is "healthy" after timestep |
- | `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the z-coordinate of the torso is no longer in the `healthy_z_range` |
- | `healthy_z_range` | **tuple** | `(1.0, 2.0)` | The humanoid is considered healthy if the z-coordinate of the torso is in this range |
- | `reset_noise_scale` | **float** | `1e-2` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
- """
-
def __init__(
self,
xml_file="humanoid.xml",
@@ -239,7 +47,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5, mujoco_bindings="mujoco_py")
@property
def healthy_reward(self):
diff --git a/gym/envs/mujoco/humanoid_v4.py b/gym/envs/mujoco/humanoid_v4.py
new file mode 100644
index 00000000000..00fc992edf6
--- /dev/null
+++ b/gym/envs/mujoco/humanoid_v4.py
@@ -0,0 +1,346 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {
+ "trackbodyid": 1,
+ "distance": 4.0,
+ "lookat": np.array((0.0, 0.0, 2.0)),
+ "elevation": -20.0,
+}
+
+
+def mass_center(model, data):
+ mass = np.expand_dims(model.body_mass, axis=1)
+ xpos = data.xipos
+ return (np.sum(mass * xpos, axis=0) / np.sum(mass))[0:2].copy()
+
+
+class HumanoidEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is based on the environment introduced by Tassa, Erez and Todorov
+ in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
+ The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a pair of
+ legs and arms. The legs each consist of two links, and so the arms (representing the knees and
+ elbows respectively). The goal of the environment is to walk forward as fast as possible without falling over.
+
+ ### Action Space
+ The agent take a 17-element vector for actions.
+ The action space is a continuous `(action, ...)` all in `[-1, 1]`, where `action`
+ represents the numerical torques applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
+ | 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
+ | 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
+ | 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
+ | 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
+ | 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
+ | 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
+ | 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
+ | 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
+ | 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
+ | 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
+ | 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
+ | 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
+ | 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
+ | 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
+ | 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
+ | 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
+ | 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the Humanoid,
+ followed by the velocities of those individual parts (their derivatives) with all the
+ positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|---------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
+ | 0 | x-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 1 | y-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 2 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 3 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 4 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 5 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 6 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 7 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
+ | 8 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
+ | 9 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
+ | 10 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
+ | 11 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
+ | 12 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
+ | 13 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
+ | 14 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
+ | 15 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
+ | 16 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
+ | 17 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
+ | 18 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
+ | 19 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
+ | 20 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
+ | 21 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
+ | 22 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
+ | 23 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
+ | 24 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 25 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 26 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 27 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 28 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 29 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 30 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
+ | 31 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
+ | 32 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
+ | 33 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
+ | 34 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
+ | 35 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
+ | 36 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
+ | 37 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
+ | 38 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
+ | 39 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
+ | 40 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
+ | 41 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
+ | 42 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
+ | 43 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
+ | 44 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
+ | 45 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
+ | 46 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
+
+ Additionally, after all the positional and velocity based values in the table,
+ the state_space consists of (in order):
+ - *cinert:* Mass and inertia of a single rigid body relative to the center of mass
+ (this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
+ and hence adds to another 140 elements in the state space.
+ - *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
+ adds another 84 elements in the state space
+ - *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
+ `(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
+ - *cfrc_ext:* This is the center of mass based external force on the body. It has shape
+ 14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
+ where *nbody* stands for the number of bodies in the robot and *nv* stands for the
+ number of degrees of freedom (*= dim(qvel)*)
+
+ The (x,y,z) coordinates are translational DOFs while the orientations are rotational
+ DOFs expressed as quaternions. One can read more about free joints on the
+ [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
+
+ **Note:** There are 47 elements in the table above - giving rise to `(378,)`
+ elements in the state space. In practice (and Gym implementation), the first two
+ positional elements are omitted from the state space since the reward function is
+ calculated based on the x-coordinate value. This value is hidden from the algorithm,
+ which in turn has to develop an abstract understanding of it from the observed rewards.
+ Therefore, observation space has shape `(376,)` instead of `(378,)` and the table should
+ not have the first two rows.
+
+ **Note:** Humanoid-v4 environment no longer has the following contact forces issue.
+ If using previous Humanoid versions from v4, there have been reported issues that using
+ a Mujoco-Py version > 2.0 results in the contact forces always being 0. As such we recommend
+ to use a Mujoco-Py version < 2.0 when using the Humanoid environment if you would like to report
+ results with contact forces (if contact forces are not used in your experiments, you can use
+ version > 2.0).
+
+ ### Rewards
+ The reward consists of three parts:
+ - *alive_bonus*: Every timestep that the humanoid is alive, it gets a reward of 5.
+ - *lin_vel_cost*: A reward of walking forward which is measured as *1.25 *
+ (average center of mass before action - average center of mass after action)/dt*.
+ *dt* is the time between actions and is dependent on the frame_skip parameter
+ (default is 5), where the *dt* for one frame is 0.003 - making the default *dt = 5 * 0.003 = 0.015*.
+ This reward would be positive if the humanoid walks forward (right) desired. The calculation
+ for the center of mass is defined in the `.py` file for the Humanoid.
+ - *quad_ctrl_cost*: A negative reward for penalising the humanoid if it has too
+ large of a control force. If there are *nu* actuators/controls, then the control has
+ shape `nu x 1`. It is measured as *0.1 **x** sum(control2)*.
+ - *quad_impact_cost*: A negative reward for penalising the humanoid if the external
+ contact force is too large. It is calculated as
+ *min(0.5 * 0.000001 * sum(external contact force2), 10)*.
+
+ The total reward returned is ***reward*** *=* *alive_bonus + lin_vel_cost - quad_ctrl_cost - quad_impact_cost*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 1.4, 1.0, 0.0 ... 0.0) with a uniform noise in the range
+ of [-0.01, 0.01] added to the positional and velocity values (values in the table)
+ for stochasticity. Note that the initial z coordinate is intentionally
+ selected to be high, thereby indicating a standing up humanoid. The initial
+ orientation is designed to make it face forward as well.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 1000 timesteps
+ 2. Any of the state space values is no longer finite
+ 3. The z-coordinate of the torso (index 0 in state space OR index 2 in the table) is **not** in the range `[1.0, 2.0]` (the humanoid has fallen or is about to fall beyond recovery).
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but
+ modifications can be made to the XML file in the assets folder (or by changing
+ the path to a modified XML file in another folder)..
+
+ ```
+ env = gym.make('Ant-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('Ant-v3', ctrl_cost_weight=0.1, ....)
+ ```
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(
+ self,
+ xml_file="humanoid.xml",
+ forward_reward_weight=1.25,
+ ctrl_cost_weight=0.1,
+ contact_cost_weight=5e-7,
+ contact_cost_range=(-np.inf, 10.0),
+ healthy_reward=5.0,
+ terminate_when_unhealthy=True,
+ healthy_z_range=(1.0, 2.0),
+ reset_noise_scale=1e-2,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._forward_reward_weight = forward_reward_weight
+ self._ctrl_cost_weight = ctrl_cost_weight
+ self._contact_cost_weight = contact_cost_weight
+ self._contact_cost_range = contact_cost_range
+ self._healthy_reward = healthy_reward
+ self._terminate_when_unhealthy = terminate_when_unhealthy
+ self._healthy_z_range = healthy_z_range
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 5)
+
+ @property
+ def healthy_reward(self):
+ return (
+ float(self.is_healthy or self._terminate_when_unhealthy)
+ * self._healthy_reward
+ )
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(self.data.ctrl))
+ return control_cost
+
+ @property
+ def contact_cost(self):
+ contact_forces = self.data.cfrc_ext
+ contact_cost = self._contact_cost_weight * np.sum(np.square(contact_forces))
+ min_cost, max_cost = self._contact_cost_range
+ contact_cost = np.clip(contact_cost, min_cost, max_cost)
+ return contact_cost
+
+ @property
+ def is_healthy(self):
+ min_z, max_z = self._healthy_z_range
+ is_healthy = min_z < self.data.qpos[2] < max_z
+
+ return is_healthy
+
+ @property
+ def done(self):
+ done = (not self.is_healthy) if self._terminate_when_unhealthy else False
+ return done
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = self.data.qvel.flat.copy()
+
+ com_inertia = self.data.cinert.flat.copy()
+ com_velocity = self.data.cvel.flat.copy()
+
+ actuator_forces = self.data.qfrc_actuator.flat.copy()
+ external_contact_forces = self.data.cfrc_ext.flat.copy()
+
+ if self._exclude_current_positions_from_observation:
+ position = position[2:]
+
+ return np.concatenate(
+ (
+ position,
+ velocity,
+ com_inertia,
+ com_velocity,
+ actuator_forces,
+ external_contact_forces,
+ )
+ )
+
+ def step(self, action):
+ xy_position_before = mass_center(self.model, self.data)
+ self.do_simulation(action, self.frame_skip)
+ xy_position_after = mass_center(self.model, self.data)
+
+ xy_velocity = (xy_position_after - xy_position_before) / self.dt
+ x_velocity, y_velocity = xy_velocity
+
+ ctrl_cost = self.control_cost(action)
+ contact_cost = self.contact_cost
+
+ forward_reward = self._forward_reward_weight * x_velocity
+ healthy_reward = self.healthy_reward
+
+ rewards = forward_reward + healthy_reward
+ costs = ctrl_cost + contact_cost
+
+ observation = self._get_obs()
+ reward = rewards - costs
+ done = self.done
+ info = {
+ "reward_linvel": forward_reward,
+ "reward_quadctrl": -ctrl_cost,
+ "reward_alive": healthy_reward,
+ "reward_impact": -contact_cost,
+ "x_position": xy_position_after[0],
+ "y_position": xy_position_after[1],
+ "distance_from_origin": np.linalg.norm(xy_position_after, ord=2),
+ "x_velocity": x_velocity,
+ "y_velocity": y_velocity,
+ "forward_reward": forward_reward,
+ }
+
+ return observation, reward, done, info
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nv
+ )
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/mujoco/humanoidstandup.py b/gym/envs/mujoco/humanoidstandup.py
index d1a6b47427d..217917b9b09 100644
--- a/gym/envs/mujoco/humanoidstandup.py
+++ b/gym/envs/mujoco/humanoidstandup.py
@@ -5,178 +5,10 @@
class HumanoidStandupEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is based on the environment introduced by Tassa, Erez and Todorov
- in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
- The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a
- pair of legs and arms. The legs each consist of two links, and so the arms (representing the
- knees and elbows respectively). The environment starts with the humanoid layiing on the ground,
- and then the goal of the environment is to make the humanoid standup and then keep it standing
- by applying torques on the various hinges.
-
- ### Action Space
- The agent take a 17-element vector for actions.
-
- The action space is a continuous `(action, ...)` all in `[-1, 1]`, where `action`
- represents the numerical torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
- | 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
- | 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
- | 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
- | 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
- | 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
- | 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
- | 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
- | 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
- | 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
- | 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
- | 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
- | 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
- | 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
- | 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
- | 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
- | 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
- | 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
-
- ### Observation Space
-
- The state space consists of positional values of different body parts of the Humanoid,
- followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
-
- **Note:** The x- and y-coordinates of the torso are being omitted to produce position-agnostic behavior in policies
-
- The observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|---------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
- | 0 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
- | 1 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 2 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 3 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 4 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
- | 5 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
- | 6 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
- | 7 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
- | 8 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
- | 9 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
- | 10 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
- | 11 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
- | 12 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
- | 13 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
- | 14 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
- | 15 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
- | 16 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
- | 17 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
- | 18 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
- | 19 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
- | 20 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
- | 21 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
- | 22 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 23 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 24 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
- | 25 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 26 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 27 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
- | 28 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
- | 29 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
- | 30 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
- | 31 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
- | 32 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
- | 33 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
- | 35 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
- | 36 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
- | 37 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
- | 38 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
- | 39 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
- | 40 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
- | 41 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
- | 42 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
- | 43 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
- | 44 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
- | 45 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
-
-
- Additionally, after all the positional and velocity based values in the table,
- the state_space consists of (in order):
- - *cinert:* Mass and inertia of a single rigid body relative to the center of mass
- (this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
- and hence adds to another 140 elements in the state space.
- - *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
- adds another 84 elements in the state space
- - *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
- `(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
- - *cfrc_ext:* This is the center of mass based external force on the body. It has shape
- 14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
- where *nbody* stands for the number of bodies in the robot and *nv* stands for the number
- of degrees of freedom (*= dim(qvel)*)
-
- The (x,y,z) coordinates are translational DOFs while the orientations are rotational
- DOFs expressed as quaternions. One can read more about free joints on the
- [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
-
- **Note:** There have been reported issues that using a Mujoco-Py version > 2.0 results
- in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
- when using the Humanoid environment if you would like to report results with contact forces
- (if contact forces are not used in your experiments, you can use version > 2.0).
-
- ### Rewards
- The reward consists of three parts:
- - *uph_cost*: A reward for moving upward (in an attempt to stand up). This is not a relative
- reward which meaures how much upward it has moved from the last timestep, but it is an
- absolute reward which measures how much upward the Humanoid has moved overall. It is
- measured as *(z coordinate after action - 0)/(atomic timestep)*, where *z coordinate after
- action* is index 0 in the state/index 2 in the table, and *atomic timestep* is the time for
- one frame of movement even though the simulation has a framerate of 5 (done in order to inflate
- rewards a little for fassteer learning).
- - *quad_ctrl_cost*: A negative reward for penalising the humanoid if it has too large of
- a control force. If there are *nu* actuators/controls, then the control has shape `nu x 1`.
- It is measured as *0.1 **x** sum(control2)*.
- - *quad_impact_cost*: A negative reward for penalising the humanoid if the external
- contact force is too large. It is calculated as *min(0.5 * 0.000001 * sum(external
- contact force2), 10)*.
-
- The total reward returned is ***reward*** *=* *uph_cost + 1 - quad_ctrl_cost - quad_impact_cost*
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 0.105, 1.0, 0.0 ... 0.0) with a uniform noise in the range of
- [-0.01, 0.01] added to the positional and velocity values (values in the table)
- for stochasticity. Note that the initial z coordinate is intentionally selected
- to be low, thereby indicating a laying down humanoid. The initial orientation is
- designed to make it face forward as well.
-
- ### Episode Termination
- The episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 1000 timesteps
- 2. Any of the state space values is no longer finite
-
- ### Arguments
-
- No additional arguments are currently supported.
-
- ```
- env = gym.make('HumanoidStandup-v2')
- ```
-
- There is no v3 for HumanoidStandup, unlike the robot environments where a v3 and
- beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
-
- ### Version History
-
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "humanoidstandup.xml", 5)
+ mujoco_env.MujocoEnv.__init__(
+ self, "humanoidstandup.xml", 5, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def _get_obs(self):
diff --git a/gym/envs/mujoco/humanoidstandup_v4.py b/gym/envs/mujoco/humanoidstandup_v4.py
new file mode 100644
index 00000000000..3d28dec9f7a
--- /dev/null
+++ b/gym/envs/mujoco/humanoidstandup_v4.py
@@ -0,0 +1,247 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class HumanoidStandupEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is based on the environment introduced by Tassa, Erez and Todorov
+ in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
+ The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a
+ pair of legs and arms. The legs each consist of two links, and so the arms (representing the
+ knees and elbows respectively). The environment starts with the humanoid layiing on the ground,
+ and then the goal of the environment is to make the humanoid standup and then keep it standing
+ by applying torques on the various hinges.
+
+ ### Action Space
+ The agent take a 17-element vector for actions.
+
+ The action space is a continuous `(action, ...)` all in `[-1, 1]`, where `action`
+ represents the numerical torques applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
+ | 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
+ | 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
+ | 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
+ | 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
+ | 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
+ | 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
+ | 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
+ | 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
+ | 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
+ | 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
+ | 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
+ | 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
+ | 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
+ | 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
+ | 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
+ | 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
+ | 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the Humanoid,
+ followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|---------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
+ | 0 | x-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 1 | y-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 2 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
+ | 3 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 4 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 5 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 6 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
+ | 7 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
+ | 8 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
+ | 9 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
+ | 10 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
+ | 11 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
+ | 12 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
+ | 13 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
+ | 14 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
+ | 15 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
+ | 16 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
+ | 17 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
+ | 18 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
+ | 19 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
+ | 20 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
+ | 21 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
+ | 22 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
+ | 23 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
+ | 24 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 25 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 26 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
+ | 27 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 28 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 29 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
+ | 30 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
+ | 31 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
+ | 32 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
+ | 33 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
+ | 34 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
+ | 35 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
+ | 36 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
+ | 37 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
+ | 38 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
+ | 39 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
+ | 40 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
+ | 41 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
+ | 42 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
+ | 43 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
+ | 44 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
+ | 45 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
+ | 46 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
+
+
+ Additionally, after all the positional and velocity based values in the table,
+ the state_space consists of (in order):
+ - *cinert:* Mass and inertia of a single rigid body relative to the center of mass
+ (this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
+ and hence adds to another 140 elements in the state space.
+ - *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
+ adds another 84 elements in the state space
+ - *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
+ `(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
+ - *cfrc_ext:* This is the center of mass based external force on the body. It has shape
+ 14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
+ where *nbody* stands for the number of bodies in the robot and *nv* stands for the number
+ of degrees of freedom (*= dim(qvel)*)
+
+ The (x,y,z) coordinates are translational DOFs while the orientations are rotational
+ DOFs expressed as quaternions. One can read more about free joints on the
+ [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
+
+ **Note:** There are 47 elements in the table above - giving rise to `(378,)`
+ elements in the state space. In practice (and Gym implementation), the first two
+ positional elements are omitted from the state space since the reward function is
+ calculated based on the x-coordinate value. This value is hidden from the algorithm,
+ which in turn has to develop an abstract understanding of it from the observed rewards.
+ Therefore, observation space has shape `(376,)` instead of `(378,)` and the table should not have the first two rows.
+
+ **Note:** HumanoidStandup-v4 environment no longer has the following contact forces issue.
+ If using previous HumanoidStandup versions from v4, there have been reported issues that using a Mujoco-Py version > 2.0 results
+ in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
+ when using the Humanoid environment if you would like to report results with contact forces
+ (if contact forces are not used in your experiments, you can use version > 2.0).
+
+ ### Rewards
+ The reward consists of three parts:
+ - *uph_cost*: A reward for moving upward (in an attempt to stand up). This is not a relative
+ reward which meaures how much upward it has moved from the last timestep, but it is an
+ absolute reward which measures how much upward the Humanoid has moved overall. It is
+ measured as *(z coordinate after action - 0)/(atomic timestep)*, where *z coordinate after
+ action* is index 0 in the state/index 2 in the table, and *atomic timestep* is the time for
+ one frame of movement even though the simulation has a framerate of 5 (done in order to inflate
+ rewards a little for fassteer learning).
+ - *quad_ctrl_cost*: A negative reward for penalising the humanoid if it has too large of
+ a control force. If there are *nu* actuators/controls, then the control has shape `nu x 1`.
+ It is measured as *0.1 **x** sum(control2)*.
+ - *quad_impact_cost*: A negative reward for penalising the humanoid if the external
+ contact force is too large. It is calculated as *min(0.5 * 0.000001 * sum(external
+ contact force2), 10)*.
+
+ The total reward returned is ***reward*** *=* *uph_cost + 1 - quad_ctrl_cost - quad_impact_cost*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 0.105, 1.0, 0.0 ... 0.0) with a uniform noise in the range of
+ [-0.01, 0.01] added to the positional and velocity values (values in the table)
+ for stochasticity. Note that the initial z coordinate is intentionally selected
+ to be low, thereby indicating a laying down humanoid. The initial orientation is
+ designed to make it face forward as well.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 1000 timesteps
+ 2. Any of the state space values is no longer finite
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but
+ modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder)..
+
+ ```
+ env = gym.make('HumanoidStandup-v2')
+ ```
+
+ There is no v3 for HumanoidStandup, unlike the robot environments where a v3 and
+ beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(self):
+ mujoco_env.MujocoEnv.__init__(self, "humanoidstandup.xml", 5)
+ utils.EzPickle.__init__(self)
+
+ def _get_obs(self):
+ data = self.data
+ return np.concatenate(
+ [
+ data.qpos.flat[2:],
+ data.qvel.flat,
+ data.cinert.flat,
+ data.cvel.flat,
+ data.qfrc_actuator.flat,
+ data.cfrc_ext.flat,
+ ]
+ )
+
+ def step(self, a):
+ self.do_simulation(a, self.frame_skip)
+ pos_after = self.data.qpos[2]
+ data = self.data
+ uph_cost = (pos_after - 0) / self.model.opt.timestep
+
+ quad_ctrl_cost = 0.1 * np.square(data.ctrl).sum()
+ quad_impact_cost = 0.5e-6 * np.square(data.cfrc_ext).sum()
+ quad_impact_cost = min(quad_impact_cost, 10)
+ reward = uph_cost - quad_ctrl_cost - quad_impact_cost + 1
+
+ done = bool(False)
+ return (
+ self._get_obs(),
+ reward,
+ done,
+ dict(
+ reward_linup=uph_cost,
+ reward_quadctrl=-quad_ctrl_cost,
+ reward_impact=-quad_impact_cost,
+ ),
+ )
+
+ def reset_model(self):
+ c = 0.01
+ self.set_state(
+ self.init_qpos + self.np_random.uniform(low=-c, high=c, size=self.model.nq),
+ self.init_qvel
+ + self.np_random.uniform(
+ low=-c,
+ high=c,
+ size=self.model.nv,
+ ),
+ )
+ return self._get_obs()
+
+ def viewer_setup(self):
+ self.viewer.cam.trackbodyid = 1
+ self.viewer.cam.distance = self.model.stat.extent * 1.0
+ self.viewer.cam.lookat[2] = 0.8925
+ self.viewer.cam.elevation = -20
diff --git a/gym/envs/mujoco/inverted_double_pendulum.py b/gym/envs/mujoco/inverted_double_pendulum.py
index f523d170e68..2f241c02b43 100644
--- a/gym/envs/mujoco/inverted_double_pendulum.py
+++ b/gym/envs/mujoco/inverted_double_pendulum.py
@@ -5,113 +5,10 @@
class InvertedDoublePendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment originates from control theory and builds on the cartpole
- environment based on the work done by Barto, Sutton, and Anderson in
- ["Neuronlike adaptive elements that can solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
- powered by the Mujoco physics simulator - allowing for more complex experiments
- (such as varying the effects of gravity or constraints). This environment involves a cart that can
- moved linearly, with a pole fixed on it and a second pole fixed on the other end of the first one
- (leaving the second pole as the only one with one free end). The cart can be pushed left or right,
- and the goal is to balance the second pole on top of the first pole, which is in turn on top of the
- cart, by applying continuous forces on the cart.
-
- ### Action Space
- The agent take a 1-element vector for actions.
- The action space is a continuous `(action)` in `[-1, 1]`, where `action` represents the
- numerical force applied to the cart (with magnitude representing the amount of force and
- sign representing the direction)
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
- | 0 | Force applied on the cart | -1 | 1 | slider | slide | Force (N) |
-
- ### Observation Space
-
- The state space consists of positional values of different body parts of the pendulum system,
- followed by the velocities of those individual parts (their derivatives) with all the
- positions ordered before all the velocities.
-
- The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
- | 1 | sine of the angle between the cart and the first pole | -Inf | Inf | sin(hinge) | hinge | unitless |
- | 2 | sine of the angle between the two poles | -Inf | Inf | sin(hinge2) | hinge | unitless |
- | 3 | cosine of the angle between the cart and the first pole | -Inf | Inf | cos(hinge) | hinge | unitless |
- | 4 | cosine of the angle between the two poles | -Inf | Inf | cos(hinge2) | hinge | unitless |
- | 5 | velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
- | 6 | angular velocity of the angle between the cart and the first pole | -Inf | Inf | hinge | hinge | angular velocity (rad/s) |
- | 7 | angular velocity of the angle between the two poles | -Inf | Inf | hinge2 | hinge | angular velocity (rad/s) |
- | 8 | constraint force - 1 | -Inf | Inf | | | Force (N) |
- | 9 | constraint force - 2 | -Inf | Inf | | | Force (N) |
- | 10 | constraint force - 3 | -Inf | Inf | | | Force (N) |
-
-
- There is physical contact between the robots and their environment - and Mujoco
- attempts at getting realisitic physics simulations for the possible physical contact
- dynamics by aiming for physical accuracy and computational efficiency.
-
- There is one constraint force for contacts for each degree of freedom (3).
- The approach and handling of constraints by Mujoco is unique to the simulator
- and is based on their research. Once can find more information in their
- [*documentation*](https://mujoco.readthedocs.io/en/latest/computation.html)
- or in their paper
- ["Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo"](https://homes.cs.washington.edu/~todorov/papers/TodorovICRA14.pdf).
-
-
- ### Rewards
-
- The reward consists of two parts:
- - *alive_bonus*: The goal is to make the second inverted pendulum stand upright
- (within a certain angle limit) as long as possible - as such a reward of +10 is awarded
- for each timestep that the second pole is upright.
- - *distance_penalty*: This reward is a measure of how far the *tip* of the second pendulum
- (the only free end) moves, and it is calculated as
- *0.01 * x2 + (y - 2)2*, where *x* is the x-coordinate of the tip
- and *y* is the y-coordinate of the tip of the second pole.
- - *velocity_penalty*: A negative reward for penalising the agent if it moves too
- fast *0.001 * v12 + 0.005 * v2 2*
-
- The total reward returned is ***reward*** *=* *alive_bonus - distance_penalty - velocity_penalty*
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
- of [-0.1, 0.1] added to the positional values (cart position and pole angles) and standard
- normal force with a standard deviation of 0.1 added to the velocity values for stochasticity.
-
- ### Episode Termination
- The episode terminates when any of the following happens:
-
- 1. The episode duration reaches 1000 timesteps.
- 2. Any of the state space values is no longer finite.
- 3. The y_coordinate of the tip of the second pole *is less than or equal* to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other).
-
- ### Arguments
-
- No additional arguments are currently supported.
-
- ```
- env = gym.make('InvertedDoublePendulum-v2')
- ```
- There is no v3 for InvertedPendulum, unlike the robot environments where a v3 and
- beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
-
- ### Version History
-
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "inverted_double_pendulum.xml", 5)
+ mujoco_env.MujocoEnv.__init__(
+ self, "inverted_double_pendulum.xml", 5, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def step(self, action):
diff --git a/gym/envs/mujoco/inverted_double_pendulum_v4.py b/gym/envs/mujoco/inverted_double_pendulum_v4.py
new file mode 100644
index 00000000000..5c1d162ad0e
--- /dev/null
+++ b/gym/envs/mujoco/inverted_double_pendulum_v4.py
@@ -0,0 +1,156 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class InvertedDoublePendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment originates from control theory and builds on the cartpole
+ environment based on the work done by Barto, Sutton, and Anderson in
+ ["Neuronlike adaptive elements that can solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
+ powered by the Mujoco physics simulator - allowing for more complex experiments
+ (such as varying the effects of gravity or constraints). This environment involves a cart that can
+ moved linearly, with a pole fixed on it and a second pole fixed on the other end of the first one
+ (leaving the second pole as the only one with one free end). The cart can be pushed left or right,
+ and the goal is to balance the second pole on top of the first pole, which is in turn on top of the
+ cart, by applying continuous forces on the cart.
+
+ ### Action Space
+ The agent take a 1-element vector for actions.
+ The action space is a continuous `(action)` in `[-1, 1]`, where `action` represents the
+ numerical force applied to the cart (with magnitude representing the amount of force and
+ sign representing the direction)
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
+ | 0 | Force applied on the cart | -1 | 1 | slider | slide | Force (N) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the pendulum system,
+ followed by the velocities of those individual parts (their derivatives) with all the
+ positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
+ | 1 | sine of the angle between the cart and the first pole | -Inf | Inf | sin(hinge) | hinge | unitless |
+ | 2 | sine of the angle between the two poles | -Inf | Inf | sin(hinge2) | hinge | unitless |
+ | 3 | cosine of the angle between the cart and the first pole | -Inf | Inf | cos(hinge) | hinge | unitless |
+ | 4 | cosine of the angle between the two poles | -Inf | Inf | cos(hinge2) | hinge | unitless |
+ | 5 | velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
+ | 6 | angular velocity of the angle between the cart and the first pole | -Inf | Inf | hinge | hinge | angular velocity (rad/s) |
+ | 7 | angular velocity of the angle between the two poles | -Inf | Inf | hinge2 | hinge | angular velocity (rad/s) |
+ | 8 | constraint force - 1 | -Inf | Inf | | | Force (N) |
+ | 9 | constraint force - 2 | -Inf | Inf | | | Force (N) |
+ | 10 | constraint force - 3 | -Inf | Inf | | | Force (N) |
+
+
+ There is physical contact between the robots and their environment - and Mujoco
+ attempts at getting realisitic physics simulations for the possible physical contact
+ dynamics by aiming for physical accuracy and computational efficiency.
+
+ There is one constraint force for contacts for each degree of freedom (3).
+ The approach and handling of constraints by Mujoco is unique to the simulator
+ and is based on their research. Once can find more information in their
+ [*documentation*](https://mujoco.readthedocs.io/en/latest/computation.html)
+ or in their paper
+ ["Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo"](https://homes.cs.washington.edu/~todorov/papers/TodorovICRA14.pdf).
+
+
+ ### Rewards
+
+ The reward consists of two parts:
+ - *alive_bonus*: The goal is to make the second inverted pendulum stand upright
+ (within a certain angle limit) as long as possible - as such a reward of +10 is awarded
+ for each timestep that the second pole is upright.
+ - *distance_penalty*: This reward is a measure of how far the *tip* of the second pendulum
+ (the only free end) moves, and it is calculated as
+ *0.01 * x2 + (y - 2)2*, where *x* is the x-coordinate of the tip
+ and *y* is the y-coordinate of the tip of the second pole.
+ - *velocity_penalty*: A negative reward for penalising the agent if it moves too
+ fast *0.001 * v12 + 0.005 * v2 2*
+
+ The total reward returned is ***reward*** *=* *alive_bonus - distance_penalty - velocity_penalty*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
+ of [-0.1, 0.1] added to the positional values (cart position and pole angles) and standard
+ normal force with a standard deviation of 0.1 added to the velocity values for stochasticity.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches 1000 timesteps.
+ 2. Any of the state space values is no longer finite.
+ 3. The y_coordinate of the tip of the second pole *is less than or equal* to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other).
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but modifications can
+ be made to the XML file in the assets folder (or by changing the path to a modified XML
+ file in another folder)..
+
+ ```
+ env = gym.make('InvertedDoublePendulum-v2')
+ ```
+ There is no v3 for InvertedPendulum, unlike the robot environments where a v3 and
+ beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(self):
+ mujoco_env.MujocoEnv.__init__(self, "inverted_double_pendulum.xml", 5)
+ utils.EzPickle.__init__(self)
+
+ def step(self, action):
+ self.do_simulation(action, self.frame_skip)
+ ob = self._get_obs()
+ x, _, y = self.data.site_xpos[0]
+ dist_penalty = 0.01 * x**2 + (y - 2) ** 2
+ v1, v2 = self.data.qvel[1:3]
+ vel_penalty = 1e-3 * v1**2 + 5e-3 * v2**2
+ alive_bonus = 10
+ r = alive_bonus - dist_penalty - vel_penalty
+ done = bool(y <= 1)
+ return ob, r, done, {}
+
+ def _get_obs(self):
+ return np.concatenate(
+ [
+ self.data.qpos[:1], # cart x pos
+ np.sin(self.data.qpos[1:]), # link angles
+ np.cos(self.data.qpos[1:]),
+ np.clip(self.data.qvel, -10, 10),
+ np.clip(self.data.qfrc_constraint, -10, 10),
+ ]
+ ).ravel()
+
+ def reset_model(self):
+ self.set_state(
+ self.init_qpos
+ + self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq),
+ self.init_qvel + self.np_random.standard_normal(self.model.nv) * 0.1,
+ )
+ return self._get_obs()
+
+ def viewer_setup(self):
+ v = self.viewer
+ v.cam.trackbodyid = 0
+ v.cam.distance = self.model.stat.extent * 0.5
+ v.cam.lookat[2] = 0.12250000000000005 # v.model.stat.center[2]
diff --git a/gym/envs/mujoco/inverted_pendulum.py b/gym/envs/mujoco/inverted_pendulum.py
index 46472cf43a9..342404b5e91 100644
--- a/gym/envs/mujoco/inverted_pendulum.py
+++ b/gym/envs/mujoco/inverted_pendulum.py
@@ -5,85 +5,11 @@
class InvertedPendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment is the cartpole environment based on the work done by
- Barto, Sutton, and Anderson in ["Neuronlike adaptive elements that can
- solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
- just like in the classic environments but now powered by the Mujoco physics simulator -
- allowing for more complex experiments (such as varying the effects of gravity).
- This environment involves a cart that can moved linearly, with a pole fixed on it
- at one end and having another end free. The cart can be pushed left or right, and the
- goal is to balance the pole on the top of the cart by applying forces on the cart.
-
- ### Action Space
- The agent take a 1-element vector for actions.
-
- The action space is a continuous `(action)` in `[-3, 3]`, where `action` represents
- the numerical force applied to the cart (with magnitude representing the amount of
- force and sign representing the direction)
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
- | 0 | Force applied on the cart | -3 | 3 | slider | slide | Force (N) |
-
- ### Observation Space
-
- The state space consists of positional values of different body parts of
- the pendulum system, followed by the velocities of those individual parts (their derivatives)
- with all the positions ordered before all the velocities.
-
- The observation is a `ndarray` with shape `(4,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
- | 1 | vertical angle of the pole on the cart | -Inf | Inf | hinge | hinge | angle (rad) |
- | 2 | linear velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
- | 3 | angular velocity of the pole on the cart | -Inf | Inf | hinge | hinge | anglular velocity (rad/s) |
-
-
- ### Rewards
-
- The goal is to make the inverted pendulum stand upright (within a certain angle limit)
- as long as possible - as such a reward of +1 is awarded for each timestep that
- the pole is upright.
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
- of [-0.01, 0.01] added to the values for stochasticity.
-
- ### Episode Termination
- The episode terminates when any of the following happens:
-
- 1. The episode duration reaches 1000 timesteps.
- 2. Any of the state space values is no longer finite.
- 3. The absolutely value of the vertical angle between the pole and the cart is greater than 0.2 radian.
-
- ### Arguments
-
- No additional arguments are currently supported.
-
- ```
- env = gym.make('InvertedPendulum-v2')
- ```
- There is no v3 for InvertedPendulum, unlike the robot environments where a
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
-
- ### Version History
-
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(self):
utils.EzPickle.__init__(self)
- mujoco_env.MujocoEnv.__init__(self, "inverted_pendulum.xml", 2)
+ mujoco_env.MujocoEnv.__init__(
+ self, "inverted_pendulum.xml", 2, mujoco_bindings="mujoco_py"
+ )
def step(self, a):
reward = 1.0
diff --git a/gym/envs/mujoco/inverted_pendulum_v4.py b/gym/envs/mujoco/inverted_pendulum_v4.py
new file mode 100644
index 00000000000..7d3fa72200e
--- /dev/null
+++ b/gym/envs/mujoco/inverted_pendulum_v4.py
@@ -0,0 +1,116 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class InvertedPendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment is the cartpole environment based on the work done by
+ Barto, Sutton, and Anderson in ["Neuronlike adaptive elements that can
+ solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
+ just like in the classic environments but now powered by the Mujoco physics simulator -
+ allowing for more complex experiments (such as varying the effects of gravity).
+ This environment involves a cart that can moved linearly, with a pole fixed on it
+ at one end and having another end free. The cart can be pushed left or right, and the
+ goal is to balance the pole on the top of the cart by applying forces on the cart.
+
+ ### Action Space
+ The agent take a 1-element vector for actions.
+
+ The action space is a continuous `(action)` in `[-3, 3]`, where `action` represents
+ the numerical force applied to the cart (with magnitude representing the amount of
+ force and sign representing the direction)
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
+ | 0 | Force applied on the cart | -3 | 3 | slider | slide | Force (N) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of
+ the pendulum system, followed by the velocities of those individual parts (their derivatives)
+ with all the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(4,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
+ | 1 | vertical angle of the pole on the cart | -Inf | Inf | hinge | hinge | angle (rad) |
+ | 2 | linear velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
+ | 3 | angular velocity of the pole on the cart | -Inf | Inf | hinge | hinge | anglular velocity (rad/s) |
+
+
+ ### Rewards
+
+ The goal is to make the inverted pendulum stand upright (within a certain angle limit)
+ as long as possible - as such a reward of +1 is awarded for each timestep that
+ the pole is upright.
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
+ of [-0.01, 0.01] added to the values for stochasticity.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches 1000 timesteps.
+ 2. Any of the state space values is no longer finite.
+ 3. The absolutely value of the vertical angle between the pole and the cart is greater than 0.2 radian.
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower),
+ but modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder).
+
+ ```
+ env = gym.make('InvertedPendulum-v2')
+ ```
+ There is no v3 for InvertedPendulum, unlike the robot environments where a
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(self):
+ utils.EzPickle.__init__(self)
+ mujoco_env.MujocoEnv.__init__(self, "inverted_pendulum.xml", 2)
+
+ def step(self, a):
+ reward = 1.0
+ self.do_simulation(a, self.frame_skip)
+ ob = self._get_obs()
+ notdone = np.isfinite(ob).all() and (np.abs(ob[1]) <= 0.2)
+ done = not notdone
+ return ob, reward, done, {}
+
+ def reset_model(self):
+ qpos = self.init_qpos + self.np_random.uniform(
+ size=self.model.nq, low=-0.01, high=0.01
+ )
+ qvel = self.init_qvel + self.np_random.uniform(
+ size=self.model.nv, low=-0.01, high=0.01
+ )
+ self.set_state(qpos, qvel)
+ return self._get_obs()
+
+ def _get_obs(self):
+ return np.concatenate([self.data.qpos, self.data.qvel]).ravel()
+
+ def viewer_setup(self):
+ v = self.viewer
+ v.cam.trackbodyid = 0
+ v.cam.distance = self.model.stat.extent
diff --git a/gym/envs/mujoco/mujoco_env.py b/gym/envs/mujoco/mujoco_env.py
index 1d04f2f9061..7bf3b260439 100644
--- a/gym/envs/mujoco/mujoco_env.py
+++ b/gym/envs/mujoco/mujoco_env.py
@@ -1,4 +1,3 @@
-import os
from collections import OrderedDict
from os import path
from typing import Optional
@@ -6,19 +5,9 @@
import numpy as np
import gym
-from gym import error, spaces
-from gym.utils import seeding
-
-try:
- import mujoco_py
-except ImportError as e:
- raise error.DependencyNotInstalled(
- "{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(
- e
- )
- )
+from gym import error, logger, spaces
-DEFAULT_SIZE = 500
+DEFAULT_SIZE = 480
def convert_observation_to_space(observation):
@@ -44,28 +33,64 @@ def convert_observation_to_space(observation):
class MujocoEnv(gym.Env):
"""Superclass for all MuJoCo environments."""
- def __init__(self, model_path, frame_skip):
+ def __init__(self, model_path, frame_skip, mujoco_bindings="mujoco"):
+
if model_path.startswith("/"):
fullpath = model_path
else:
- fullpath = os.path.join(os.path.dirname(__file__), "assets", model_path)
+ fullpath = path.join(path.dirname(__file__), "assets", model_path)
if not path.exists(fullpath):
raise OSError(f"File {fullpath} does not exist")
+
+ if mujoco_bindings == "mujoco_py":
+ logger.warn(
+ "This version of the mujoco environments depends "
+ "on the mujoco-py bindings, which are no longer maintained "
+ "and may stop working. Please upgrade to the v4 versions of "
+ "the environments (which depend on the mujoco python bindings instead), unless "
+ "you are trying to precisely replicate previous works)."
+ )
+ try:
+ import mujoco_py
+
+ self._mujoco_bindings = mujoco_py
+ except ImportError as e:
+ raise error.DependencyNotInstalled(
+ "{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(
+ e
+ )
+ )
+
+ self.model = self._mujoco_bindings.load_model_from_path(fullpath)
+ self.sim = self._mujoco_bindings.MjSim(self.model)
+ self.data = self.sim.data
+
+ elif mujoco_bindings == "mujoco":
+ try:
+ import mujoco
+
+ self._mujoco_bindings = mujoco
+
+ except ImportError as e:
+ raise error.DependencyNotInstalled(
+ f"{e}. (HINT: you need to install mujoco)"
+ )
+ self.model = self._mujoco_bindings.MjModel.from_xml_path(fullpath)
+ self.data = self._mujoco_bindings.MjData(self.model)
+
+ self.init_qpos = self.data.qpos.ravel().copy()
+ self.init_qvel = self.data.qvel.ravel().copy()
+ self._viewers = {}
+
self.frame_skip = frame_skip
- self.model = mujoco_py.load_model_from_path(fullpath)
- self.sim = mujoco_py.MjSim(self.model)
- self.data = self.sim.data
+
self.viewer = None
- self._viewers = {}
self.metadata = {
"render_modes": ["human", "rgb_array", "depth_array"],
"render_fps": int(np.round(1.0 / self.dt)),
}
- self.init_qpos = self.sim.data.qpos.ravel().copy()
- self.init_qvel = self.sim.data.qvel.ravel().copy()
-
self._set_action_space()
action = self.action_space.sample()
@@ -112,7 +137,12 @@ def reset(
options: Optional[dict] = None,
):
super().reset(seed=seed)
- self.sim.reset()
+
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ self.sim.reset()
+ else:
+ self._mujoco_bindings.mj_resetData(self.model, self.data)
+
ob = self.reset_model()
if not return_info:
return ob
@@ -121,12 +151,19 @@ def reset(
def set_state(self, qpos, qvel):
assert qpos.shape == (self.model.nq,) and qvel.shape == (self.model.nv,)
- old_state = self.sim.get_state()
- new_state = mujoco_py.MjSimState(
- old_state.time, qpos, qvel, old_state.act, old_state.udd_state
- )
- self.sim.set_state(new_state)
- self.sim.forward()
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ state = self.sim.get_state()
+ state = self._mujoco_bindings.MjSimState(
+ state.time, qpos, qvel, state.act, state.udd_state
+ )
+ self.sim.set_state(state)
+ self.sim.forward()
+ else:
+ self.data.qpos[:] = np.copy(qpos)
+ self.data.qvel[:] = np.copy(qvel)
+ if self.model.na == 0:
+ self.data.act[:] = None
+ self._mujoco_bindings.mj_forward(self.model, self.data)
@property
def dt(self):
@@ -136,9 +173,21 @@ def do_simulation(self, ctrl, n_frames):
if np.array(ctrl).shape != self.action_space.shape:
raise ValueError("Action dimension mismatch")
- self.sim.data.ctrl[:] = ctrl
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ self.sim.data.ctrl[:] = ctrl
+ else:
+ self.data.ctrl[:] = ctrl
for _ in range(n_frames):
- self.sim.step()
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ self.sim.step()
+ else:
+ self._mujoco_bindings.mj_step(self.model, self.data)
+
+ # As of MuJoCo 2.0, force-related quantities like cacc are not computed
+ # unless there's a force sensor in the model.
+ # See https://github.com/openai/gym/issues/1541
+ if self._mujoco_bindings.__name__ != "mujoco_py":
+ self._mujoco_bindings.mj_rnePostConstraint(self.model, self.data)
def render(
self,
@@ -159,19 +208,25 @@ def render(
if no_camera_specified:
camera_name = "track"
- if camera_id is None and camera_name in self.model._camera_name2id:
- camera_id = self.model.camera_name2id(camera_name)
+ if camera_id is None:
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ if camera_name in self.model._camera_name2id:
+ camera_id = self.model.camera_name2id(camera_name)
+ else:
+ camera_id = self._mujoco_bindings.mj_name2id(
+ self.model,
+ self._mujoco_bindings.mjtObj.mjOBJ_CAMERA,
+ camera_name,
+ )
- self._get_viewer(mode).render(width, height, camera_id=camera_id)
+ self._get_viewer(mode).render(width, height, camera_id=camera_id)
if mode == "rgb_array":
- # window size used for old mujoco-py:
data = self._get_viewer(mode).read_pixels(width, height, depth=False)
# original image is upside-down, so flip it
return data[::-1, :, :]
elif mode == "depth_array":
self._get_viewer(mode).render(width, height)
- # window size used for old mujoco-py:
# Extract depth part of the read_pixels() tuple
data = self._get_viewer(mode).read_pixels(width, height, depth=True)[1]
# original image is upside-down, so flip it
@@ -181,24 +236,40 @@ def render(
def close(self):
if self.viewer is not None:
- # self.viewer.finish()
self.viewer = None
self._viewers = {}
- def _get_viewer(self, mode):
+ def _get_viewer(self, mode, width=DEFAULT_SIZE, height=DEFAULT_SIZE):
self.viewer = self._viewers.get(mode)
if self.viewer is None:
if mode == "human":
- self.viewer = mujoco_py.MjViewer(self.sim)
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ self.viewer = self._mujoco_bindings.MjViewer(self.sim)
+ else:
+ from gym.envs.mujoco.mujoco_rendering import Viewer
+
+ self.viewer = Viewer(self.model, self.data)
elif mode == "rgb_array" or mode == "depth_array":
- self.viewer = mujoco_py.MjRenderContextOffscreen(self.sim, -1)
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ self.viewer = self._mujoco_bindings.MjRenderContextOffscreen(
+ self.sim, -1
+ )
+ else:
+ from gym.envs.mujoco.mujoco_rendering import RenderContextOffscreen
+
+ self.viewer = RenderContextOffscreen(
+ width, height, self.model, self.data
+ )
self.viewer_setup()
self._viewers[mode] = self.viewer
return self.viewer
def get_body_com(self, body_name):
- return self.data.get_body_xpos(body_name)
+ if self._mujoco_bindings.__name__ == "mujoco_py":
+ return self.data.get_body_xpos(body_name)
+ else:
+ return self.data.body(body_name).xpos
def state_vector(self):
- return np.concatenate([self.sim.data.qpos.flat, self.sim.data.qvel.flat])
+ return np.concatenate([self.data.qpos.flat, self.data.qvel.flat])
diff --git a/gym/envs/mujoco/mujoco_rendering.py b/gym/envs/mujoco/mujoco_rendering.py
new file mode 100644
index 00000000000..97f7ffcb71a
--- /dev/null
+++ b/gym/envs/mujoco/mujoco_rendering.py
@@ -0,0 +1,556 @@
+import collections
+import os
+import sys
+import time
+from threading import Lock
+
+import glfw
+import imageio
+import mujoco
+import numpy as np
+
+
+def _import_egl(width, height):
+ from mujoco.egl import GLContext
+
+ return GLContext(width, height)
+
+
+def _import_glfw(width, height):
+ from mujoco.glfw import GLContext
+
+ return GLContext(width, height)
+
+
+def _import_osmesa(width, height):
+ from mujoco.osmesa import GLContext
+
+ return GLContext(width, height)
+
+
+_ALL_RENDERERS = collections.OrderedDict(
+ [
+ ("glfw", _import_glfw),
+ ("egl", _import_egl),
+ ("osmesa", _import_osmesa),
+ ]
+)
+
+
+class RenderContext:
+ """Render context superclass for offscreen and window rendering."""
+
+ def __init__(self, model, data, offscreen=True):
+
+ self.model = model
+ self.data = data
+ self.offscreen = offscreen
+ max_geom = 1000
+
+ mujoco.mj_forward(self.model, self.data)
+
+ self.scn = mujoco.MjvScene(self.model, max_geom)
+ self.cam = mujoco.MjvCamera()
+ self.vopt = mujoco.MjvOption()
+ self.pert = mujoco.MjvPerturb()
+ self.con = mujoco.MjrContext(self.model, mujoco.mjtFontScale.mjFONTSCALE_150)
+
+ self._markers = []
+ self._overlays = {}
+
+ self._init_camera()
+ self._set_mujoco_buffers()
+
+ def _set_mujoco_buffers(self):
+ if self.offscreen:
+ mujoco.mjr_setBuffer(mujoco.mjtFramebuffer.mjFB_OFFSCREEN, self.con)
+ if self.con.currentBuffer != mujoco.mjtFramebuffer.mjFB_OFFSCREEN:
+ raise RuntimeError("Offscreen rendering not supported")
+ else:
+ mujoco.mjr_setBuffer(mujoco.mjtFramebuffer.mjFB_WINDOW, self.con)
+ if self.con.currentBuffer != mujoco.mjtFramebuffer.mjFB_WINDOW:
+ raise RuntimeError("Window rendering not supported")
+
+ def update_offscreen_size(self, width, height):
+ if width != self.con.offWidth or height != self.con.offHeight:
+ self.model.vis.global_.offwidth = width
+ self.model.vis.global_.offheight = height
+ self.con.free()
+ self._set_mujoco_buffers()
+
+ def render(self, width, height, camera_id=None, segmentation=False):
+ rect = mujoco.MjrRect(left=0, bottom=0, width=width, height=height)
+
+ # Sometimes buffers are too small.
+ if width > self.con.offWidth or height > self.con.offHeight:
+ new_width = max(width, self.model.vis.global_.offwidth)
+ new_height = max(height, self.model.vis.global_.offheight)
+ self.update_offscreen_size(new_width, new_height)
+
+ if camera_id is not None:
+ if camera_id == -1:
+ self.cam.type = mujoco.mjtCamera.mjCAMERA_FREE
+ else:
+ self.cam.type = mujoco.mjtCamera.mjCAMERA_FIXED
+ self.cam.fixedcamid = camera_id
+
+ mujoco.mjv_updateScene(
+ self.model,
+ self.data,
+ self.vopt,
+ self.pert,
+ self.cam,
+ mujoco.mjtCatBit.mjCAT_ALL,
+ self.scn,
+ )
+
+ if segmentation:
+ self.scn.flags[mujoco.mjtRndFlag.mjRND_SEGMENT] = 1
+ self.scn.flags[mujoco.mjtRndFlag.mjRND_IDCOLOR] = 1
+
+ for marker_params in self._markers:
+ self._add_marker_to_scene(marker_params)
+
+ mujoco.mjr_render(rect, self.scn, self.con)
+
+ for gridpos, (text1, text2) in self._overlays.items():
+ mujoco.mjr_overlay(
+ mujoco.mjtFontScale.mjFONTSCALE_150,
+ gridpos,
+ rect,
+ text1.encode(),
+ text2.encode(),
+ self.con,
+ )
+
+ if segmentation:
+ self.scn.flags[mujoco.mjtRndFlag.mjRND_SEGMENT] = 0
+ self.scn.flags[mujoco.mjtRndFlag.mjRND_IDCOLOR] = 0
+
+ def read_pixels(self, width, height, depth=True, segmentation=False):
+ rect = mujoco.MjrRect(left=0, bottom=0, width=width, height=height)
+
+ rgb_arr = np.zeros(3 * rect.width * rect.height, dtype=np.uint8)
+ depth_arr = np.zeros(rect.width * rect.height, dtype=np.float32)
+
+ mujoco.mjr_readPixels(rgb_arr, depth_arr, rect, self.con)
+ rgb_img = rgb_arr.reshape(rect.height, rect.width, 3)
+
+ ret_img = rgb_img
+ if segmentation:
+ seg_img = (
+ rgb_img[:, :, 0]
+ + rgb_img[:, :, 1] * (2**8)
+ + rgb_img[:, :, 2] * (2**16)
+ )
+ seg_img[seg_img >= (self.scn.ngeom + 1)] = 0
+ seg_ids = np.full((self.scn.ngeom + 1, 2), fill_value=-1, dtype=np.int32)
+
+ for i in range(self.scn.ngeom):
+ geom = self.scn.geoms[i]
+ if geom.segid != -1:
+ seg_ids[geom.segid + 1, 0] = geom.objtype
+ seg_ids[geom.segid + 1, 1] = geom.objid
+ ret_img = seg_ids[seg_img]
+
+ if depth:
+ depth_img = depth_arr.reshape(rect.height, rect.width)
+ return (ret_img, depth_img)
+ else:
+ return ret_img
+
+ def _init_camera(self):
+ self.cam.type = mujoco.mjtCamera.mjCAMERA_FREE
+ self.cam.fixedcamid = -1
+ for i in range(3):
+ self.cam.lookat[i] = np.median(self.data.geom_xpos[:, i])
+ self.cam.distance = self.model.stat.extent
+
+ def add_overlay(self, gridpos: int, text1: str, text2: str):
+ """Overlays text on the scene."""
+ if gridpos not in self._overlays:
+ self._overlays[gridpos] = ["", ""]
+ self._overlays[gridpos][0] += text1 + "\n"
+ self._overlays[gridpos][1] += text2 + "\n"
+
+ def add_marker(self, **marker_params):
+ self._markers.append(marker_params)
+
+ def _add_marker_to_scene(self, marker):
+ if self.scn.ngeom >= self.scn.maxgeom:
+ raise RuntimeError("Ran out of geoms. maxgeom: %d" % self.scn.maxgeom)
+
+ g = self.scn.geoms[self.scn.ngeom]
+ # default values.
+ g.dataid = -1
+ g.objtype = mujoco.mjtObj.mjOBJ_UNKNOWN
+ g.objid = -1
+ g.category = mujoco.mjtCatBit.mjCAT_DECOR
+ g.texid = -1
+ g.texuniform = 0
+ g.texrepeat[0] = 1
+ g.texrepeat[1] = 1
+ g.emission = 0
+ g.specular = 0.5
+ g.shininess = 0.5
+ g.reflectance = 0
+ g.type = mujoco.mjtGeom.mjGEOM_BOX
+ g.size[:] = np.ones(3) * 0.1
+ g.mat[:] = np.eye(3)
+ g.rgba[:] = np.ones(4)
+
+ for key, value in marker.items():
+ if isinstance(value, (int, float, mujoco._enums.mjtGeom)):
+ setattr(g, key, value)
+ elif isinstance(value, (tuple, list, np.ndarray)):
+ attr = getattr(g, key)
+ attr[:] = np.asarray(value).reshape(attr.shape)
+ elif isinstance(value, str):
+ assert key == "label", "Only label is a string in mjtGeom."
+ if value is None:
+ g.label[0] = 0
+ else:
+ g.label = value
+ elif hasattr(g, key):
+ raise ValueError(
+ "mjtGeom has attr {} but type {} is invalid".format(
+ key, type(value)
+ )
+ )
+ else:
+ raise ValueError("mjtGeom doesn't have field %s" % key)
+
+ self.scn.ngeom += 1
+
+
+class RenderContextOffscreen(RenderContext):
+ """Offscreen rendering class with opengl context."""
+
+ def __init__(self, width, height, model, data):
+
+ self._get_opengl_backend(width, height)
+ self.opengl_context.make_current()
+
+ super().__init__(model, data, offscreen=True)
+
+ def _get_opengl_backend(self, width, height):
+
+ backend = os.environ.get("MUJOCO_GL")
+ if backend is not None:
+ try:
+ self.opengl_context = _ALL_RENDERERS[backend](width, height)
+ except KeyError:
+ raise RuntimeError(
+ "Environment variable {} must be one of {!r}: got {!r}.".format(
+ "MUJOCO_GL", _ALL_RENDERERS.keys(), backend
+ )
+ )
+
+ else:
+ for name, import_func in _ALL_RENDERERS.items():
+ try:
+ self.opengl_context = _ALL_RENDERERS["osmesa"](width, height)
+ backend = name
+ break
+ except:
+ pass
+ if backend is None:
+ raise RuntimeError(
+ "No OpenGL backend could be imported. Attempting to create a "
+ "rendering context will result in a RuntimeError."
+ )
+
+
+class Viewer(RenderContext):
+ """Class for window rendering in all MuJoCo environments."""
+
+ def __init__(self, model, data):
+ self._gui_lock = Lock()
+ self._button_left_pressed = False
+ self._button_right_pressed = False
+ self._last_mouse_x = 0
+ self._last_mouse_y = 0
+ self._paused = False
+ self._transparent = False
+ self._contacts = False
+ self._render_every_frame = True
+ self._image_idx = 0
+ self._image_path = "/tmp/frame_%07d.png"
+ self._time_per_render = 1 / 60.0
+ self._run_speed = 1.0
+ self._loop_count = 0
+ self._advance_by_one_step = False
+ self._hide_menu = False
+
+ # glfw init
+ glfw.init()
+ width, height = glfw.get_video_mode(glfw.get_primary_monitor()).size
+ self.window = glfw.create_window(width // 2, height // 2, "mujoco", None, None)
+ glfw.make_context_current(self.window)
+ glfw.swap_interval(1)
+
+ framebuffer_width, framebuffer_height = glfw.get_framebuffer_size(self.window)
+ window_width, _ = glfw.get_window_size(self.window)
+ self._scale = framebuffer_width * 1.0 / window_width
+
+ # set callbacks
+ glfw.set_cursor_pos_callback(self.window, self._cursor_pos_callback)
+ glfw.set_mouse_button_callback(self.window, self._mouse_button_callback)
+ glfw.set_scroll_callback(self.window, self._scroll_callback)
+ glfw.set_key_callback(self.window, self._key_callback)
+
+ # get viewport
+ self.viewport = mujoco.MjrRect(0, 0, framebuffer_width, framebuffer_height)
+
+ super().__init__(model, data, offscreen=False)
+
+ def _key_callback(self, window, key, scancode, action, mods):
+ if action != glfw.RELEASE:
+ return
+ # Switch cameras
+ elif key == glfw.KEY_TAB:
+ self.cam.fixedcamid += 1
+ self.cam.type = mujoco.mjtCamera.mjCAMERA_FIXED
+ if self.cam.fixedcamid >= self.model.ncam:
+ self.cam.fixedcamid = -1
+ self.cam.type = mujoco.mjtCamera.mjCAMERA_FREE
+ # Pause simulation
+ elif key == glfw.KEY_SPACE and self._paused is not None:
+ self._paused = not self._paused
+ # Advances simulation by one step.
+ elif key == glfw.KEY_RIGHT and self._paused is not None:
+ self._advance_by_one_step = True
+ self._paused = True
+ # Slows down simulation
+ elif key == glfw.KEY_S:
+ self._run_speed /= 2.0
+ # Speeds up simulation
+ elif key == glfw.KEY_F:
+ self._run_speed *= 2.0
+ # Turn off / turn on rendering every frame.
+ elif key == glfw.KEY_D:
+ self._render_every_frame = not self._render_every_frame
+ # Capture screenshot
+ elif key == glfw.KEY_T:
+ img = np.zeros(
+ (
+ glfw.get_framebuffer_size(self.window)[1],
+ glfw.get_framebuffer_size(self.window)[0],
+ 3,
+ ),
+ dtype=np.uint8,
+ )
+ mujoco.mjr_readPixels(img, None, self.viewport, self.con)
+ imageio.imwrite(self._image_path % self._image_idx, np.flipud(img))
+ self._image_idx += 1
+ # Display contact forces
+ elif key == glfw.KEY_C:
+ self._contacts = not self._contacts
+ self.vopt.flags[mujoco.mjtVisFlag.mjVIS_CONTACTPOINT] = self._contacts
+ self.vopt.flags[mujoco.mjtVisFlag.mjVIS_CONTACTFORCE] = self._contacts
+ # Display coordinate frames
+ elif key == glfw.KEY_E:
+ self.vopt.frame = 1 - self.vopt.frame
+ # Hide overlay menu
+ elif key == glfw.KEY_H:
+ self._hide_menu = not self._hide_menu
+ # Make transparent
+ elif key == glfw.KEY_R:
+ self._transparent = not self._transparent
+ if self._transparent:
+ self.model.geom_rgba[:, 3] /= 5.0
+ else:
+ self.model.geom_rgba[:, 3] *= 5.0
+ # Geom group visibility
+ elif key in (glfw.KEY_0, glfw.KEY_1, glfw.KEY_2, glfw.KEY_3, glfw.KEY_4):
+ self.vopt.geomgroup[key - glfw.KEY_0] ^= 1
+ # Quit
+ if key == glfw.KEY_ESCAPE:
+ print("Pressed ESC")
+ print("Quitting.")
+ glfw.terminate()
+ sys.exit(0)
+
+ def _cursor_pos_callback(self, window, xpos, ypos):
+ if not (self._button_left_pressed or self._button_right_pressed):
+ return
+
+ mod_shift = (
+ glfw.get_key(window, glfw.KEY_LEFT_SHIFT) == glfw.PRESS
+ or glfw.get_key(window, glfw.KEY_RIGHT_SHIFT) == glfw.PRESS
+ )
+ if self._button_right_pressed:
+ action = (
+ mujoco.mjtMouse.mjMOUSE_MOVE_H
+ if mod_shift
+ else mujoco.mjtMouse.mjMOUSE_MOVE_V
+ )
+ elif self._button_left_pressed:
+ action = (
+ mujoco.mjtMouse.mjMOUSE_ROTATE_H
+ if mod_shift
+ else mujoco.mjtMouse.mjMOUSE_ROTATE_V
+ )
+ else:
+ action = mujoco.mjtMouse.mjMOUSE_ZOOM
+
+ dx = int(self._scale * xpos) - self._last_mouse_x
+ dy = int(self._scale * ypos) - self._last_mouse_y
+ width, height = glfw.get_framebuffer_size(window)
+
+ with self._gui_lock:
+ mujoco.mjv_moveCamera(
+ self.model, action, dx / height, dy / height, self.scn, self.cam
+ )
+
+ self._last_mouse_x = int(self._scale * xpos)
+ self._last_mouse_y = int(self._scale * ypos)
+
+ def _mouse_button_callback(self, window, button, act, mods):
+ self._button_left_pressed = (
+ glfw.get_mouse_button(window, glfw.MOUSE_BUTTON_LEFT) == glfw.PRESS
+ )
+ self._button_right_pressed = (
+ glfw.get_mouse_button(window, glfw.MOUSE_BUTTON_RIGHT) == glfw.PRESS
+ )
+
+ x, y = glfw.get_cursor_pos(window)
+ self._last_mouse_x = int(self._scale * x)
+ self._last_mouse_y = int(self._scale * y)
+
+ def _scroll_callback(self, window, x_offset, y_offset):
+ with self._gui_lock:
+ mujoco.mjv_moveCamera(
+ self.model,
+ mujoco.mjtMouse.mjMOUSE_ZOOM,
+ 0,
+ -0.05 * y_offset,
+ self.scn,
+ self.cam,
+ )
+
+ def _create_overlay(self):
+ topleft = mujoco.mjtGridPos.mjGRID_TOPLEFT
+ topright = mujoco.mjtGridPos.mjGRID_TOPRIGHT
+ bottomleft = mujoco.mjtGridPos.mjGRID_BOTTOMLEFT
+ bottomright = mujoco.mjtGridPos.mjGRID_BOTTOMRIGHT
+
+ if self._render_every_frame:
+ self.add_overlay(topleft, "", "")
+ else:
+ self.add_overlay(
+ topleft,
+ "Run speed = %.3f x real time" % self._run_speed,
+ "[S]lower, [F]aster",
+ )
+ self.add_overlay(
+ topleft, "Ren[d]er every frame", "On" if self._render_every_frame else "Off"
+ )
+ self.add_overlay(
+ topleft,
+ "Switch camera (#cams = %d)" % (self.model.ncam + 1),
+ "[Tab] (camera ID = %d)" % self.cam.fixedcamid,
+ )
+ self.add_overlay(topleft, "[C]ontact forces", "On" if self._contacts else "Off")
+ self.add_overlay(topleft, "T[r]ansparent", "On" if self._transparent else "Off")
+ if self._paused is not None:
+ if not self._paused:
+ self.add_overlay(topleft, "Stop", "[Space]")
+ else:
+ self.add_overlay(topleft, "Start", "[Space]")
+ self.add_overlay(
+ topleft, "Advance simulation by one step", "[right arrow]"
+ )
+ self.add_overlay(
+ topleft, "Referenc[e] frames", "On" if self.vopt.frame == 1 else "Off"
+ )
+ self.add_overlay(topleft, "[H]ide Menu", "")
+ if self._image_idx > 0:
+ fname = self._image_path % (self._image_idx - 1)
+ self.add_overlay(topleft, "Cap[t]ure frame", "Saved as %s" % fname)
+ else:
+ self.add_overlay(topleft, "Cap[t]ure frame", "")
+ self.add_overlay(topleft, "Toggle geomgroup visibility", "0-4")
+
+ self.add_overlay(bottomleft, "FPS", "%d%s" % (1 / self._time_per_render, ""))
+ self.add_overlay(
+ bottomleft, "Solver iterations", str(self.data.solver_iter + 1)
+ )
+ self.add_overlay(
+ bottomleft, "Step", str(round(self.data.time / self.model.opt.timestep))
+ )
+ self.add_overlay(bottomleft, "timestep", "%.5f" % self.model.opt.timestep)
+
+ def render(self):
+ # mjv_updateScene, mjr_render, mjr_overlay
+ def update():
+ # fill overlay items
+ self._create_overlay()
+
+ render_start = time.time()
+ if self.window is None:
+ return
+ elif glfw.window_should_close(self.window):
+ glfw.terminate()
+ sys.exit(0)
+ self.viewport.width, self.viewport.height = glfw.get_framebuffer_size(
+ self.window
+ )
+ with self._gui_lock:
+ # update scene
+ mujoco.mjv_updateScene(
+ self.model,
+ self.data,
+ self.vopt,
+ mujoco.MjvPerturb(),
+ self.cam,
+ mujoco.mjtCatBit.mjCAT_ALL.value,
+ self.scn,
+ )
+ # marker items
+ for marker in self._markers:
+ self._add_marker_to_scene(marker)
+ # render
+ mujoco.mjr_render(self.viewport, self.scn, self.con)
+ # overlay items
+ if not self._hide_menu:
+ for gridpos, [t1, t2] in self._overlays.items():
+ mujoco.mjr_overlay(
+ mujoco.mjtFontScale.mjFONTSCALE_150,
+ gridpos,
+ self.viewport,
+ t1,
+ t2,
+ self.con,
+ )
+ glfw.swap_buffers(self.window)
+ glfw.poll_events()
+ self._time_per_render = 0.9 * self._time_per_render + 0.1 * (
+ time.time() - render_start
+ )
+
+ # clear overlay
+ self._overlays.clear()
+
+ if self._paused:
+ while self._paused:
+ update()
+ if self._advance_by_one_step:
+ self._advance_by_one_step = False
+ break
+ else:
+ self._loop_count += self.model.opt.timestep / (
+ self._time_per_render * self._run_speed
+ )
+ if self._render_every_frame:
+ self._loop_count = 1
+ while self._loop_count > 0:
+ update()
+ self._loop_count -= 1
+
+ # clear markers
+ self._markers[:] = []
+
+ def close(self):
+ glfw.terminate()
+ sys.exit(0)
diff --git a/gym/envs/mujoco/pusher.py b/gym/envs/mujoco/pusher.py
index 3be870e0176..ee6da42be76 100644
--- a/gym/envs/mujoco/pusher.py
+++ b/gym/envs/mujoco/pusher.py
@@ -1,4 +1,3 @@
-import mujoco_py
import numpy as np
from gym import utils
@@ -6,131 +5,11 @@
class PusherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
- "Pusher" is a multi-jointed robot arm which is very similar to that of a human.
- The goal is to move a target cylinder (called *object*) to a goal position using the robot's end effector (called *fingertip*).
- The robot consists of shoulder, elbow, forearm, and wrist joints.
-
- ### Action Space
- The action space is a `Box(-2, 2, (7,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|--------------------------------------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
- | 0 | Rotation of the panning the shoulder | -2 | 2 | r_shoulder_pan_joint | hinge | torque (N m) |
- | 1 | Rotation of the shoulder lifting joint | -2 | 2 | r_shoulder_lift_joint | hinge | torque (N m) |
- | 2 | Rotation of the shoulder rolling joint | -2 | 2 | r_upper_arm_roll_joint | hinge | torque (N m) |
- | 3 | Rotation of hinge joint that flexed the elbow | -2 | 2 | r_elbow_flex_joint | hinge | torque (N m) |
- | 4 | Rotation of hinge that rolls the forearm | -2 | 2 | r_forearm_roll_joint | hinge | torque (N m) |
- | 5 | Rotation of flexing the wrist | -2 | 2 | r_wrist_flex_joint | hinge | torque (N m) |
- | 6 | Rotation of rolling the wrist | -2 | 2 | r_wrist_roll_joint | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of
-
- - Angle of rotational joints on the pusher
- - Anglular velocities of rotational joints on the pusher
- - The coordinates of the fingertip of the pusher
- - The coordinates of the object to be moved
- - The coordinates of the goal position
-
- The observation is a `ndarray` with shape `(23,)` where the elements correspond to the table below.
- An analogy can be drawn to a human arm in order to help understand the state space, with the words flex and roll meaning the
- same as human joints.
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | Rotation of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angle (rad) |
- | 1 | Rotation of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angle (rad) |
- | 2 | Rotation of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angle (rad) |
- | 3 | Rotation of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angle (rad) |
- | 4 | Rotation of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angle (rad) |
- | 5 | Rotation of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angle (rad) |
- | 6 | Rotation of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angle (rad) |
- | 7 | Rotational velocity of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angular velocity (rad/s) |
- | 8 | Rotational velocity of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angular velocity (rad/s) |
- | 9 | Rotational velocity of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angular velocity (rad/s) |
- | 10 | Rotational velocity of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angular velocity (rad/s) |
- | 11 | Rotational velocity of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angular velocity (rad/s) |
- | 12 | Rotational velocity of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angular velocity (rad/s) |
- | 13 | Rotational velocity of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angular velocity (rad/s) |
- | 14 | x-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
- | 15 | y-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
- | 16 | z-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
- | 17 | x-coordinate of the object to be moved | -Inf | Inf | object (obj_slidex) | slide | position (m) |
- | 18 | y-coordinate of the object to be moved | -Inf | Inf | object (obj_slidey) | slide | position (m) |
- | 19 | z-coordinate of the object to be moved | -Inf | Inf | object | cylinder | position (m) |
- | 20 | x-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidex) | slide | position (m) |
- | 21 | y-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidey) | slide | position (m) |
- | 22 | z-coordinate of the goal position of the object | -Inf | Inf | goal | sphere | position (m) |
-
-
- ### Rewards
- The reward consists of two parts:
- - *reward_near *: This reward is a measure of how far the *fingertip*
- of the pusher (the unattached end) is from the object, with a more negative
- value assigned for when the pusher's *fingertip* is further away from the
- target. It is calculated as the negative vector norm of (position of
- the fingertip - position of target), or *-norm("fingertip" - "target")*.
- - *reward_dist *: This reward is a measure of how far the object is from
- the target goal position, with a more negative value assigned for object is
- further away from the target. It is calculated as the negative vector norm of
- (position of the object - position of goal), or *-norm("object" - "target")*.
- - *reward_control*: A negative reward for penalising the pusher if
- it takes actions that are too large. It is measured as the negative squared
- Euclidean norm of the action, i.e. as *- sum(action2)*.
-
- The total reward returned is ***reward*** *=* *reward_dist + 0.1 * reward_ctrl + 0.5 * reward_near*
-
- Unlike other environments, Pusher does not allow you to specify weights for the individual reward terms.
- However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
- you should create a wrapper that computes the weighted reward from `info`.
-
-
- ### Starting State
- All pusher (not including object and goal) states start in
- (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0). A uniform noise in the range
- [-0.005, 0.005] is added to the velocity attributes only. The velocities of
- the object and goal are permanently set to 0. The object's x-position is selected uniformly
- between [-0.3, 0] while the y-position is selected uniformly between [-0.2, 0.2], and this
- process is repeated until the vector norm between the object's (x,y) position and origin is not greater
- than 0.17. The goal always have the same position of (0.45, -0.05, -0.323).
-
- The default framerate is 5 with each frame lasting for 0.01, giving rise to a *dt = 5 * 0.01 = 0.05*
-
- ### Episode Termination
-
- The episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 100 timesteps.
- 2. Any of the state space values is no longer finite.
-
- ### Arguments
-
- No additional arguments are currently supported (in v2 and lower),
- but modifications can be made to the XML file in the assets folder
- (or by changing the path to a modified XML file in another folder)..
-
- ```
- env = gym.make('Pusher-v2')
- ```
-
- There is no v3 for Pusher, unlike the robot environments where a v3 and
- beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
-
- ### Version History
-
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(self):
utils.EzPickle.__init__(self)
- mujoco_env.MujocoEnv.__init__(self, "pusher.xml", 5)
+ mujoco_env.MujocoEnv.__init__(
+ self, "pusher.xml", 5, mujoco_bindings="mujoco_py"
+ )
def step(self, a):
vec_1 = self.get_body_com("object") - self.get_body_com("tips_arm")
diff --git a/gym/envs/mujoco/pusher_v4.py b/gym/envs/mujoco/pusher_v4.py
new file mode 100644
index 00000000000..afb4b2efb82
--- /dev/null
+++ b/gym/envs/mujoco/pusher_v4.py
@@ -0,0 +1,185 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class PusherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+ "Pusher" is a multi-jointed robot arm which is very similar to that of a human.
+ The goal is to move a target cylinder (called *object*) to a goal position using the robot's end effector (called *fingertip*).
+ The robot consists of shoulder, elbow, forearm, and wrist joints.
+
+ ### Action Space
+ The action space is a `Box(-2, 2, (7,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|--------------------------------------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
+ | 0 | Rotation of the panning the shoulder | -2 | 2 | r_shoulder_pan_joint | hinge | torque (N m) |
+ | 1 | Rotation of the shoulder lifting joint | -2 | 2 | r_shoulder_lift_joint | hinge | torque (N m) |
+ | 2 | Rotation of the shoulder rolling joint | -2 | 2 | r_upper_arm_roll_joint | hinge | torque (N m) |
+ | 3 | Rotation of hinge joint that flexed the elbow | -2 | 2 | r_elbow_flex_joint | hinge | torque (N m) |
+ | 4 | Rotation of hinge that rolls the forearm | -2 | 2 | r_forearm_roll_joint | hinge | torque (N m) |
+ | 5 | Rotation of flexing the wrist | -2 | 2 | r_wrist_flex_joint | hinge | torque (N m) |
+ | 6 | Rotation of rolling the wrist | -2 | 2 | r_wrist_roll_joint | hinge | torque (N m) |
+
+ ### Observation Space
+
+ Observations consist of
+
+ - Angle of rotational joints on the pusher
+ - Anglular velocities of rotational joints on the pusher
+ - The coordinates of the fingertip of the pusher
+ - The coordinates of the object to be moved
+ - The coordinates of the goal position
+
+ The observation is a `ndarray` with shape `(23,)` where the elements correspond to the table below.
+ An analogy can be drawn to a human arm in order to help understand the state space, with the words flex and roll meaning the
+ same as human joints.
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | Rotation of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angle (rad) |
+ | 1 | Rotation of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angle (rad) |
+ | 2 | Rotation of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angle (rad) |
+ | 3 | Rotation of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angle (rad) |
+ | 4 | Rotation of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angle (rad) |
+ | 5 | Rotation of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angle (rad) |
+ | 6 | Rotation of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angle (rad) |
+ | 7 | Rotational velocity of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angular velocity (rad/s) |
+ | 8 | Rotational velocity of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angular velocity (rad/s) |
+ | 9 | Rotational velocity of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angular velocity (rad/s) |
+ | 10 | Rotational velocity of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angular velocity (rad/s) |
+ | 11 | Rotational velocity of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angular velocity (rad/s) |
+ | 12 | Rotational velocity of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angular velocity (rad/s) |
+ | 13 | Rotational velocity of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angular velocity (rad/s) |
+ | 14 | x-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
+ | 15 | y-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
+ | 16 | z-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
+ | 17 | x-coordinate of the object to be moved | -Inf | Inf | object (obj_slidex) | slide | position (m) |
+ | 18 | y-coordinate of the object to be moved | -Inf | Inf | object (obj_slidey) | slide | position (m) |
+ | 19 | z-coordinate of the object to be moved | -Inf | Inf | object | cylinder | position (m) |
+ | 20 | x-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidex) | slide | position (m) |
+ | 21 | y-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidey) | slide | position (m) |
+ | 22 | z-coordinate of the goal position of the object | -Inf | Inf | goal | sphere | position (m) |
+
+
+ ### Rewards
+ The reward consists of two parts:
+ - *reward_near *: This reward is a measure of how far the *fingertip*
+ of the pusher (the unattached end) is from the object, with a more negative
+ value assigned for when the pusher's *fingertip* is further away from the
+ target. It is calculated as the negative vector norm of (position of
+ the fingertip - position of target), or *-norm("fingertip" - "target")*.
+ - *reward_dist *: This reward is a measure of how far the object is from
+ the target goal position, with a more negative value assigned for object is
+ further away from the target. It is calculated as the negative vector norm of
+ (position of the object - position of goal), or *-norm("object" - "target")*.
+ - *reward_control*: A negative reward for penalising the pusher if
+ it takes actions that are too large. It is measured as the negative squared
+ Euclidean norm of the action, i.e. as *- sum(action2)*.
+
+ The total reward returned is ***reward*** *=* *reward_dist + 0.1 * reward_ctrl + 0.5 * reward_near*
+
+ Unlike other environments, Pusher does not allow you to specify weights for the individual reward terms.
+ However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
+ you should create a wrapper that computes the weighted reward from `info`.
+
+
+ ### Starting State
+ All pusher (not including object and goal) states start in
+ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0). A uniform noise in the range
+ [-0.005, 0.005] is added to the velocity attributes only. The velocities of
+ the object and goal are permanently set to 0. The object's x-position is selected uniformly
+ between [-0.3, 0] while the y-position is selected uniformly between [-0.2, 0.2], and this
+ process is repeated until the vector norm between the object's (x,y) position and origin is not greater
+ than 0.17. The goal always have the same position of (0.45, -0.05, -0.323).
+
+ The default framerate is 5 with each frame lasting for 0.01, giving rise to a *dt = 5 * 0.01 = 0.05*
+
+ ### Episode Termination
+
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 100 timesteps.
+ 2. Any of the state space values is no longer finite.
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower),
+ but modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder)..
+
+ ```
+ env = gym.make('Pusher-v2')
+ ```
+
+ There is no v3 for Pusher, unlike the robot environments where a v3 and
+ beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(self):
+ utils.EzPickle.__init__(self)
+ mujoco_env.MujocoEnv.__init__(self, "pusher.xml", 5)
+
+ def step(self, a):
+ vec_1 = self.get_body_com("object") - self.get_body_com("tips_arm")
+ vec_2 = self.get_body_com("object") - self.get_body_com("goal")
+
+ reward_near = -np.linalg.norm(vec_1)
+ reward_dist = -np.linalg.norm(vec_2)
+ reward_ctrl = -np.square(a).sum()
+ reward = reward_dist + 0.1 * reward_ctrl + 0.5 * reward_near
+
+ self.do_simulation(a, self.frame_skip)
+ ob = self._get_obs()
+ done = False
+ return ob, reward, done, dict(reward_dist=reward_dist, reward_ctrl=reward_ctrl)
+
+ def viewer_setup(self):
+ self.viewer.cam.trackbodyid = -1
+ self.viewer.cam.distance = 4.0
+
+ def reset_model(self):
+ qpos = self.init_qpos
+
+ self.goal_pos = np.asarray([0, 0])
+ while True:
+ self.cylinder_pos = np.concatenate(
+ [
+ self.np_random.uniform(low=-0.3, high=0, size=1),
+ self.np_random.uniform(low=-0.2, high=0.2, size=1),
+ ]
+ )
+ if np.linalg.norm(self.cylinder_pos - self.goal_pos) > 0.17:
+ break
+
+ qpos[-4:-2] = self.cylinder_pos
+ qpos[-2:] = self.goal_pos
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=-0.005, high=0.005, size=self.model.nv
+ )
+ qvel[-4:] = 0
+ self.set_state(qpos, qvel)
+ return self._get_obs()
+
+ def _get_obs(self):
+ return np.concatenate(
+ [
+ self.data.qpos.flat[:7],
+ self.data.qvel.flat[:7],
+ self.get_body_com("tips_arm"),
+ self.get_body_com("object"),
+ self.get_body_com("goal"),
+ ]
+ )
diff --git a/gym/envs/mujoco/reacher.py b/gym/envs/mujoco/reacher.py
index 0df3a974bd3..5fd1834bcc5 100644
--- a/gym/envs/mujoco/reacher.py
+++ b/gym/envs/mujoco/reacher.py
@@ -5,121 +5,11 @@
class ReacherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
- "Reacher" is a two-jointed robot arm. The goal is to move the robot's end effector (called *fingertip*) close to a
- target that is spawned at a random position.
-
- ### Action Space
- The action space is a `Box(-1, 1, (2,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|---------------------------------------------------------------------------------|-------------|-------------|--------------------------|-------|------|
- | 0 | Torque applied at the first hinge (connecting the link to the point of fixture) | -1 | 1 | joint0 | hinge | torque (N m) |
- | 1 | Torque applied at the second hinge (connecting the two links) | -1 | 1 | joint1 | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of
-
- - The cosine of the angles of the two arms
- - The sine of the angles of the two arms
- - The coordinates of the target
- - The angular velocities of the arms
- - The vector between the target and the reacher's fingertip (3 dimensional with the last element being 0)
-
- The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | cosine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
- | 1 | cosine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
- | 2 | sine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
- | 3 | sine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
- | 4 | x-coorddinate of the target | -Inf | Inf | target_x | slide | position (m) |
- | 5 | y-coorddinate of the target | -Inf | Inf | target_y | slide | position (m) |
- | 6 | angular velocity of the first arm | -Inf | Inf | joint0 | hinge | angular velocity (rad/s) |
- | 7 | angular velocity of the second arm | -Inf | Inf | joint1 | hinge | angular velocity (rad/s) |
- | 8 | x-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
- | 9 | y-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
- | 10 | z-value of position_fingertip - position_target (0 since reacher is 2d and z is same for both) | -Inf | Inf | NA | slide | position (m) |
-
-
- Most Gym environments just return the positions and velocity of the
- joints in the `.xml` file as the state of the environment. However, in
- reacher the state is created by combining only certain elements of the
- position and velocity, and performing some function transformations on them.
- If one is to read the `.xml` for reacher then they will find 4 joints:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|-----------------------------|----------|----------|----------------------------------|-------|--------------------|
- | 0 | angle of the first arm | -Inf | Inf | joint0 | hinge | angle (rad) |
- | 1 | angle of the second arm | -Inf | Inf | joint1 | hinge | angle (rad) |
- | 2 | x-coordinate of the target | -Inf | Inf | target_x | slide | position (m) |
- | 3 | y-coordinate of the target | -Inf | Inf | target_y | slide | position (m) |
-
-
- ### Rewards
- The reward consists of two parts:
- - *reward_distance*: This reward is a measure of how far the *fingertip*
- of the reacher (the unattached end) is from the target, with a more negative
- value assigned for when the reacher's *fingertip* is further away from the
- target. It is calculated as the negative vector norm of (position of
- the fingertip - position of target), or *-norm("fingertip" - "target")*.
- - *reward_control*: A negative reward for penalising the walker if
- it takes actions that are too large. It is measured as the negative squared
- Euclidean norm of the action, i.e. as *- sum(action2)*.
-
- The total reward returned is ***reward*** *=* *reward_distance + reward_control*
-
- Unlike other environments, Reacher does not allow you to specify weights for the individual reward terms.
- However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
- you should create a wrapper that computes the weighted reward from `info`.
-
-
- ### Starting State
- All observations start in state
- (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
- with a noise added for stochasticity. A uniform noise in the range
- [-0.1, 0.1] is added to the positional attributes, while the target position
- is selected uniformly at random in a disk of radius 0.2 around the origin.
- Independent, uniform noise in the
- range of [-0.005, 0.005] is added to the velocities, and the last
- element ("fingertip" - "target") is calculated at the end once everything
- is set. The default setting has a framerate of 2 and a *dt = 2 * 0.01 = 0.02*
-
- ### Episode Termination
-
- The episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 50 timesteps (with a new random target popping up if the reacher's fingertip reaches it before 50 timesteps)
- 2. Any of the state space values is no longer finite.
-
- ### Arguments
-
- No additional arguments are currently supported (in v2 and lower),
- but modifications can be made to the XML file in the assets folder
- (or by changing the path to a modified XML file in another folder)..
-
- ```
- env = gym.make('Reacher-v2')
- ```
-
- There is no v3 for Reacher, unlike the robot environments where a v3 and
- beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
-
- ### Version History
-
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(self):
utils.EzPickle.__init__(self)
- mujoco_env.MujocoEnv.__init__(self, "reacher.xml", 2)
+ mujoco_env.MujocoEnv.__init__(
+ self, "reacher.xml", 2, mujoco_bindings="mujoco_py"
+ )
def step(self, a):
vec = self.get_body_com("fingertip") - self.get_body_com("target")
diff --git a/gym/envs/mujoco/reacher_v4.py b/gym/envs/mujoco/reacher_v4.py
new file mode 100644
index 00000000000..152ba86bbd4
--- /dev/null
+++ b/gym/envs/mujoco/reacher_v4.py
@@ -0,0 +1,165 @@
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class ReacherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+ "Reacher" is a two-jointed robot arm. The goal is to move the robot's end effector (called *fingertip*) close to a
+ target that is spawned at a random position.
+
+ ### Action Space
+ The action space is a `Box(-1, 1, (2,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|---------------------------------------------------------------------------------|-------------|-------------|--------------------------|-------|------|
+ | 0 | Torque applied at the first hinge (connecting the link to the point of fixture) | -1 | 1 | joint0 | hinge | torque (N m) |
+ | 1 | Torque applied at the second hinge (connecting the two links) | -1 | 1 | joint1 | hinge | torque (N m) |
+
+ ### Observation Space
+
+ Observations consist of
+
+ - The cosine of the angles of the two arms
+ - The sine of the angles of the two arms
+ - The coordinates of the target
+ - The angular velocities of the arms
+ - The vector between the target and the reacher's fingertip (3 dimensional with the last element being 0)
+
+ The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | cosine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
+ | 1 | cosine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
+ | 2 | sine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
+ | 3 | sine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
+ | 4 | x-coorddinate of the target | -Inf | Inf | target_x | slide | position (m) |
+ | 5 | y-coorddinate of the target | -Inf | Inf | target_y | slide | position (m) |
+ | 6 | angular velocity of the first arm | -Inf | Inf | joint0 | hinge | angular velocity (rad/s) |
+ | 7 | angular velocity of the second arm | -Inf | Inf | joint1 | hinge | angular velocity (rad/s) |
+ | 8 | x-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
+ | 9 | y-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
+ | 10 | z-value of position_fingertip - position_target (0 since reacher is 2d and z is same for both) | -Inf | Inf | NA | slide | position (m) |
+
+
+ Most Gym environments just return the positions and velocity of the
+ joints in the `.xml` file as the state of the environment. However, in
+ reacher the state is created by combining only certain elements of the
+ position and velocity, and performing some function transformations on them.
+ If one is to read the `.xml` for reacher then they will find 4 joints:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|-----------------------------|----------|----------|----------------------------------|-------|--------------------|
+ | 0 | angle of the first arm | -Inf | Inf | joint0 | hinge | angle (rad) |
+ | 1 | angle of the second arm | -Inf | Inf | joint1 | hinge | angle (rad) |
+ | 2 | x-coordinate of the target | -Inf | Inf | target_x | slide | position (m) |
+ | 3 | y-coordinate of the target | -Inf | Inf | target_y | slide | position (m) |
+
+
+ ### Rewards
+ The reward consists of two parts:
+ - *reward_distance*: This reward is a measure of how far the *fingertip*
+ of the reacher (the unattached end) is from the target, with a more negative
+ value assigned for when the reacher's *fingertip* is further away from the
+ target. It is calculated as the negative vector norm of (position of
+ the fingertip - position of target), or *-norm("fingertip" - "target")*.
+ - *reward_control*: A negative reward for penalising the walker if
+ it takes actions that are too large. It is measured as the negative squared
+ Euclidean norm of the action, i.e. as *- sum(action2)*.
+
+ The total reward returned is ***reward*** *=* *reward_distance + reward_control*
+
+ Unlike other environments, Reacher does not allow you to specify weights for the individual reward terms.
+ However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
+ you should create a wrapper that computes the weighted reward from `info`.
+
+
+ ### Starting State
+ All observations start in state
+ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
+ with a noise added for stochasticity. A uniform noise in the range
+ [-0.1, 0.1] is added to the positional attributes, while the target position
+ is selected uniformly at random in a disk of radius 0.2 around the origin.
+ Independent, uniform noise in the
+ range of [-0.005, 0.005] is added to the velocities, and the last
+ element ("fingertip" - "target") is calculated at the end once everything
+ is set. The default setting has a framerate of 2 and a *dt = 2 * 0.01 = 0.02*
+
+ ### Episode Termination
+
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 50 timesteps (with a new random target popping up if the reacher's fingertip reaches it before 50 timesteps)
+ 2. Any of the state space values is no longer finite.
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower),
+ but modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder)..
+
+ ```
+ env = gym.make('Reacher-v2')
+ ```
+
+ There is no v3 for Reacher, unlike the robot environments where a v3 and
+ beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(self):
+ utils.EzPickle.__init__(self)
+ mujoco_env.MujocoEnv.__init__(self, "reacher.xml", 2)
+
+ def step(self, a):
+ vec = self.get_body_com("fingertip") - self.get_body_com("target")
+ reward_dist = -np.linalg.norm(vec)
+ reward_ctrl = -np.square(a).sum()
+ reward = reward_dist + reward_ctrl
+ self.do_simulation(a, self.frame_skip)
+ ob = self._get_obs()
+ done = False
+ return ob, reward, done, dict(reward_dist=reward_dist, reward_ctrl=reward_ctrl)
+
+ def viewer_setup(self):
+ self.viewer.cam.trackbodyid = 0
+
+ def reset_model(self):
+ qpos = (
+ self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq)
+ + self.init_qpos
+ )
+ while True:
+ self.goal = self.np_random.uniform(low=-0.2, high=0.2, size=2)
+ if np.linalg.norm(self.goal) < 0.2:
+ break
+ qpos[-2:] = self.goal
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=-0.005, high=0.005, size=self.model.nv
+ )
+ qvel[-2:] = 0
+ self.set_state(qpos, qvel)
+ return self._get_obs()
+
+ def _get_obs(self):
+ theta = self.data.qpos.flat[:2]
+ return np.concatenate(
+ [
+ np.cos(theta),
+ np.sin(theta),
+ self.data.qpos.flat[2:],
+ self.data.qvel.flat[:2],
+ self.get_body_com("fingertip") - self.get_body_com("target"),
+ ]
+ )
diff --git a/gym/envs/mujoco/swimmer.py b/gym/envs/mujoco/swimmer.py
index 429852f79a8..bfa6a8895fc 100644
--- a/gym/envs/mujoco/swimmer.py
+++ b/gym/envs/mujoco/swimmer.py
@@ -6,7 +6,9 @@
class SwimmerEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "swimmer.xml", 4)
+ mujoco_env.MujocoEnv.__init__(
+ self, "swimmer.xml", 4, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def step(self, a):
diff --git a/gym/envs/mujoco/swimmer_v3.py b/gym/envs/mujoco/swimmer_v3.py
index db07f238f28..18f848c3678 100644
--- a/gym/envs/mujoco/swimmer_v3.py
+++ b/gym/envs/mujoco/swimmer_v3.py
@@ -9,119 +9,6 @@
class SwimmerEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment corresponds to the Swimmer environment described in Rémi Coulom's PhD thesis
- ["Reinforcement Learning Using Neural Networks, with Applications to Motor Control"](https://tel.archives-ouvertes.fr/tel-00003985/document).
- The environment aims to increase the number of independent state and control
- variables as compared to the classic control environments. The swimmers
- consist of three or more segments ('***links***') and one less articulation
- joints ('***rotors***') - one rotor joint connecting exactly two links to
- form a linear chain. The swimmer is suspended in a two dimensional pool and
- always starts in the same position (subject to some deviation drawn from an
- uniform distribution), and the goal is to move as fast as possible towards
- the right by applying torque on the rotors and using the fluids friction.
-
- ### Notes
-
- The problem parameters are:
- Problem parameters:
- * *n*: number of body parts
- * *mi*: mass of part *i* (*i* ∈ {1...n})
- * *li*: length of part *i* (*i* ∈ {1...n})
- * *k*: viscous-friction coefficient
-
- While the default environment has *n* = 3, *li* = 0.1,
- and *k* = 0.1. It is possible to pass a custom MuJoCo XML file during construction to increase the
- number of links, or to tweak any of the parameters.
-
- ### Action Space
- The action space is a `Box(-1, 1, (2,), float32)`. An action represents the torques applied between *links*
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
- | 0 | Torque applied on the first rotor | -1 | 1 | rot2 | hinge | torque (N m) |
- | 1 | Torque applied on the second rotor | -1 | 1 | rot3 | hinge | torque (N m) |
-
- ### Observation Space
-
- By default, observations consists of:
- * θi: angle of part *i* with respect to the *x* axis
- * θi': its derivative with respect to time (angular velocity)
-
- In the default case, observations do not include the x- and y-coordinates of the front tip. These may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- Then, the observation space will have 10 dimensions where the first two dimensions
- represent the x- and y-coordinates of the front tip.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
- will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
-
- By default, the observation is a `ndarray` with shape `(8,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
- |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
- | 0 | angle of the front tip | -Inf | Inf | rot | hinge | angle (rad) |
- | 1 | angle of the second rotor | -Inf | Inf | rot2 | hinge | angle (rad) |
- | 2 | angle of the second rotor | -Inf | Inf | rot3 | hinge | angle (rad) |
- | 3 | velocity of the tip along the x-axis | -Inf | Inf | slider1 | slide | velocity (m/s) |
- | 4 | velocity of the tip along the y-axis | -Inf | Inf | slider2 | slide | velocity (m/s) |
- | 5 | angular velocity of front tip | -Inf | Inf | rot | hinge | angular velocity (rad/s) |
- | 6 | angular velocity of second rotor | -Inf | Inf | rot2 | hinge | angular velocity (rad/s) |
- | 7 | angular velocity of third rotor | -Inf | Inf | rot3 | hinge | angular velocity (rad/s) |
-
- ### Rewards
- The reward consists of two parts:
- - *forward_reward*: A reward of moving forward which is measured
- as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
- the time between actions and is dependent on the frame_skip parameter
- (default is 4), where the frametime is 0.01 - making the
- default *dt = 4 * 0.01 = 0.04*. This reward would be positive if the swimmer
- swims right as desired.
- - *ctrl_cost*: A cost for penalising the swimmer if it takes
- actions that are too large. It is measured as *`ctrl_cost_weight` *
- sum(action2)* where *`ctrl_cost_weight`* is a parameter set for the
- control and has a default value of 1e-4
-
- The total reward returned is ***reward*** *=* *forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
-
- ### Starting State
- All observations start in state (0,0,0,0,0,0,0,0) with a Uniform noise in the range of [-`reset_noise_scale`, `reset_noise_scale`] is added to the initial state for stochasticity.
-
- ### Episode Termination
- The episode terminates when the episode length is greater than 1000.
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- gym.make('Swimmer-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('Swimmer-v3', ctrl_cost_weight=0.1, ....)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|--------------|-------------------------------|
- | `xml_file` | **str** | `"swimmer.xml"`| Path to a MuJoCo model |
- | `forward_reward_weight` | **float** | `1.0` | Weight for *forward_reward* term (see section on reward) |
- | `ctrl_cost_weight` | **float** | `1e-4` | Weight for *ctrl_cost* term (see section on reward) |
- | `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
- """
-
def __init__(
self,
xml_file="swimmer.xml",
@@ -141,7 +28,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4, mujoco_bindings="mujoco_py")
def control_cost(self, action):
control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
diff --git a/gym/envs/mujoco/swimmer_v4.py b/gym/envs/mujoco/swimmer_v4.py
new file mode 100644
index 00000000000..76c380eb428
--- /dev/null
+++ b/gym/envs/mujoco/swimmer_v4.py
@@ -0,0 +1,217 @@
+__credits__ = ["Rushiv Arora"]
+
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {}
+
+
+class SwimmerEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment corresponds to the Swimmer environment described in Rémi Coulom's PhD thesis
+ ["Reinforcement Learning Using Neural Networks, with Applications to Motor Control"](https://tel.archives-ouvertes.fr/tel-00003985/document).
+ The environment aims to increase the number of independent state and control
+ variables as compared to the classic control environments. The swimmers
+ consist of three or more segments ('***links***') and one less articulation
+ joints ('***rotors***') - one rotor joint connecting exactly two links to
+ form a linear chain. The swimmer is suspended in a two dimensional pool and
+ always starts in the same position (subject to some deviation drawn from an
+ uniform distribution), and the goal is to move as fast as possible towards
+ the right by applying torque on the rotors and using the fluids friction.
+
+ ### Notes
+
+ The problem parameters are:
+ Problem parameters:
+ * *n*: number of body parts
+ * *mi*: mass of part *i* (*i* ∈ {1...n})
+ * *li*: length of part *i* (*i* ∈ {1...n})
+ * *k*: viscous-friction coefficient
+
+ While the default environment has *n* = 3, *li* = 0.1,
+ and *k* = 0.1. It is possible to tweak the MuJoCo XML files to increase the
+ number of links, or to tweak any of the parameters.
+
+ ### Action Space
+ The agent take a 2-element vector for actions.
+ The action space is a continuous `(action, action)` in `[-1, 1]`, where
+ `action` represents the numerical torques applied between *links*
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
+ | 0 | Torque applied on the first rotor | -1 | 1 | rot2 | hinge | torque (N m) |
+ | 1 | Torque applied on the second rotor | -1 | 1 | rot3 | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of:
+ * A0: position of first point
+ * θi: angle of part *i* with respect to the *x* axis
+ * A0, θi: their derivatives with respect to time (velocity and angular velocity)
+
+ The observation is a `ndarray` with shape `(8,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | x-coordinate of the front tip | -Inf | Inf | slider1 | slide | position (m) |
+ | 1 | y-coordinate of the front tip | -Inf | Inf | slider2 | slide | position (m) |
+ | 2 | angle of the front tip | -Inf | Inf | rot | hinge | angle (rad) |
+ | 3 | angle of the second rotor | -Inf | Inf | rot2 | hinge | angle (rad) |
+ | 4 | angle of the second rotor | -Inf | Inf | rot3 | hinge | angle (rad) |
+ | 5 | velocity of the tip along the x-axis | -Inf | Inf | slider1 | slide | velocity (m/s) |
+ | 6 | velocity of the tip along the y-axis | -Inf | Inf | slider2 | slide | velocity (m/s) |
+ | 7 | angular velocity of front tip | -Inf | Inf | rot | hinge | angular velocity (rad/s) |
+ | 8 | angular velocity of second rotor | -Inf | Inf | rot2 | hinge | angular velocity (rad/s) |
+ | 9 | angular velocity of third rotor | -Inf | Inf | rot3 | hinge | angular velocity (rad/s) |
+
+ **Note:**
+ In practice (and Gym implementation), the first two positional elements are
+ omitted from the state space since the reward function is calculated based
+ on those values. Therefore, observation space has shape `(8,)` and looks like:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint| Unit |
+ |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
+ | 0 | angle of the front tip | -Inf | Inf | rot | hinge | angle (rad) |
+ | 1 | angle of the second rotor | -Inf | Inf | rot2 | hinge | angle (rad) |
+ | 2 | angle of the second rotor | -Inf | Inf | rot3 | hinge | angle (rad) |
+ | 3 | velocity of the tip along the x-axis | -Inf | Inf | slider1 | slide | velocity (m/s) |
+ | 4 | velocity of the tip along the y-axis | -Inf | Inf | slider2 | slide | velocity (m/s) |
+ | 5 | angular velocity of front tip | -Inf | Inf | rot | hinge | angular velocity (rad/s) |
+ | 6 | angular velocity of second rotor | -Inf | Inf | rot2 | hinge | angular velocity (rad/s) |
+ | 7 | angular velocity of third rotor | -Inf | Inf | rot3 | hinge | angular velocity (rad/s) |
+
+ ### Rewards
+ The reward consists of two parts:
+ - *reward_front*: A reward of moving forward which is measured
+ as *(x-coordinate before action - x-coordinate after action)/dt*. *dt* is
+ the time between actions and is dependeent on the frame_skip parameter
+ (default is 4), where the *dt* for one frame is 0.01 - making the
+ default *dt = 4 * 0.01 = 0.04*. This reward would be positive if the swimmer
+ swims right as desired.
+ - *reward_control*: A negative reward for penalising the swimmer if it takes
+ actions that are too large. It is measured as *-coefficient x
+ sum(action2)* where *coefficient* is a parameter set for the
+ control and has a default value of 0.0001
+
+ The total reward returned is ***reward*** *=* *reward_front + reward_control*
+
+ ### Starting State
+ All observations start in state (0,0,0,0,0,0,0,0) with a Uniform noise in the range of [-0.1, 0.1] is added to the initial state for stochasticity.
+
+ ### Episode Termination
+ The episode terminates when the episode length is greater than 1000.
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower), but
+ modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder).
+
+ ```
+ gym.make('Swimmer-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('Swimmer-v3', ctrl_cost_weight=0.1, ....)
+ ```
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+ """
+
+ def __init__(
+ self,
+ xml_file="swimmer.xml",
+ forward_reward_weight=1.0,
+ ctrl_cost_weight=1e-4,
+ reset_noise_scale=0.1,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._forward_reward_weight = forward_reward_weight
+ self._ctrl_cost_weight = ctrl_cost_weight
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
+ return control_cost
+
+ def step(self, action):
+ xy_position_before = self.data.qpos[0:2].copy()
+ self.do_simulation(action, self.frame_skip)
+ xy_position_after = self.data.qpos[0:2].copy()
+
+ xy_velocity = (xy_position_after - xy_position_before) / self.dt
+ x_velocity, y_velocity = xy_velocity
+
+ forward_reward = self._forward_reward_weight * x_velocity
+
+ ctrl_cost = self.control_cost(action)
+
+ observation = self._get_obs()
+ reward = forward_reward - ctrl_cost
+ done = False
+ info = {
+ "reward_fwd": forward_reward,
+ "reward_ctrl": -ctrl_cost,
+ "x_position": xy_position_after[0],
+ "y_position": xy_position_after[1],
+ "distance_from_origin": np.linalg.norm(xy_position_after, ord=2),
+ "x_velocity": x_velocity,
+ "y_velocity": y_velocity,
+ "forward_reward": forward_reward,
+ }
+
+ return observation, reward, done, info
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = self.data.qvel.flat.copy()
+
+ if self._exclude_current_positions_from_observation:
+ position = position[2:]
+
+ observation = np.concatenate([position, velocity]).ravel()
+ return observation
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nv
+ )
+
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/mujoco/walker2d.py b/gym/envs/mujoco/walker2d.py
index 915ff45f7f1..fc35d633bd0 100644
--- a/gym/envs/mujoco/walker2d.py
+++ b/gym/envs/mujoco/walker2d.py
@@ -6,7 +6,9 @@
class Walker2dEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self):
- mujoco_env.MujocoEnv.__init__(self, "walker2d.xml", 4)
+ mujoco_env.MujocoEnv.__init__(
+ self, "walker2d.xml", 4, mujoco_bindings="mujoco_py"
+ )
utils.EzPickle.__init__(self)
def step(self, a):
diff --git a/gym/envs/mujoco/walker2d_v3.py b/gym/envs/mujoco/walker2d_v3.py
index 2c091810f28..a03ef972453 100644
--- a/gym/envs/mujoco/walker2d_v3.py
+++ b/gym/envs/mujoco/walker2d_v3.py
@@ -12,137 +12,6 @@
class Walker2dEnv(mujoco_env.MujocoEnv, utils.EzPickle):
- """
- ### Description
-
- This environment builds on the hopper environment based on the work done by Erez, Tassa, and Todorov
- in ["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf)
- by adding another set of legs making it possible for the robot to walker forward instead of
- hop. Like other Mujoco environments, this environment aims to increase the number of independent state
- and control variables as compared to the classic control environments. The walker is a
- two-dimensional two-legged figure that consist of four main body parts - a single torso at the top
- (with the two legs splitting after the torso), two thighs in the middle below the torso, two legs
- in the bottom below the thighs, and two feet attached to the legs on which the entire body rests.
- The goal is to make coordinate both sets of feet, legs, and thighs to move in the forward (right)
- direction by applying torques on the six hinges connecting the six body parts.
-
- ### Action Space
- The action space is a `Box(-1, 1, (6,), float32)`. An action represents the torques applied at the hinge joints.
-
- | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|----------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
- | 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
- | 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
- | 2 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
- | 3 | Torque applied on the left thigh rotor | -1 | 1 | thigh_left_joint | hinge | torque (N m) |
- | 4 | Torque applied on the left leg rotor | -1 | 1 | leg_left_joint | hinge | torque (N m) |
- | 5 | Torque applied on the left foot rotor | -1 | 1 | foot_left_joint | hinge | torque (N m) |
-
- ### Observation Space
-
- Observations consist of positional values of different body parts of the walker,
- followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
-
- By default, observations do not include the x-coordinate of the top. It may
- be included by passing `exclude_current_positions_from_observation=False` during construction.
- In that case, the observation space will have 18 dimensions where the first dimension
- represent the x-coordinates of the top of the walker.
- Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
- of the top will be returned in `info` with key `"x_position"`.
-
- By default, observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
-
- | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
- |-----|--------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
- | 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz (torso) | slide | position (m) |
- | 1 | angle of the top | -Inf | Inf | rooty (torso) | hinge | angle (rad) |
- | 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
- | 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
- | 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
- | 5 | angle of the left thigh joint | -Inf | Inf | thigh_left_joint | hinge | angle (rad) |
- | 6 | angle of the left leg joint | -Inf | Inf | leg_left_joint | hinge | angle (rad) |
- | 7 | angle of the left foot joint | -Inf | Inf | foot_left_joint | hinge | angle (rad) |
- | 8 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
- | 9 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
- | 10 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
- | 11 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
- | 12 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
- | 13 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
- | 14 | angular velocity of the thigh hinge | -Inf | Inf | thigh_left_joint | hinge | angular velocity (rad/s) |
- | 15 | angular velocity of the leg hinge | -Inf | Inf | leg_left_joint | hinge | angular velocity (rad/s) |
- | 16 | angular velocity of the foot hinge | -Inf | Inf | foot_left_joint | hinge | angular velocity (rad/s) |
-
- ### Rewards
- The reward consists of three parts:
- - *healthy_reward*: Every timestep that the walker is alive, it receives a fixed reward of value `healthy_reward`,
- - *forward_reward*: A reward of walking forward which is measured as
- *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*.
- *dt* is the time between actions and is dependeent on the frame_skip parameter
- (default is 4), where the frametime is 0.002 - making the default
- *dt = 4 * 0.002 = 0.008*. This reward would be positive if the walker walks forward (right) desired.
- - *ctrl_cost*: A cost for penalising the walker if it
- takes actions that are too large. It is measured as
- *`ctrl_cost_weight` * sum(action2)* where *`ctrl_cost_weight`* is
- a parameter set for the control and has a default value of 0.001
-
- The total reward returned is ***reward*** *=* *healthy_reward bonus + forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
-
- ### Starting State
- All observations start in state
- (0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
- with a uniform noise in the range of [-`reset_noise_scale`, `reset_noise_scale`] added to the values for stochasticity.
-
- ### Episode Termination
- The walker is said to be unhealthy if any of the following happens:
-
- 1. Any of the state space values is no longer finite
- 2. The height of the walker is ***not*** in the closed interval specified by `healthy_z_range`
- 3. The absolute value of the angle (`observation[1]` if `exclude_current_positions_from_observation=False`, else `observation[2]`) is ***not*** in the closed interval specified by `healthy_angle_range`
-
- If `terminate_when_unhealthy=True` is passed during construction (which is the default),
- the episode terminates when any of the following happens:
-
- 1. The episode duration reaches a 1000 timesteps
- 2. The walker is unhealthy
-
- If `terminate_when_unhealthy=False` is passed, the episode is terminated only when 1000 timesteps are exceeded.
-
- ### Arguments
-
- No additional arguments are currently supported in v2 and lower.
-
- ```
- env = gym.make('Walker2d-v2')
- ```
-
- v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
-
- ```
- env = gym.make('Walker2d-v3', ctrl_cost_weight=0.1, ....)
- ```
-
- | Parameter | Type | Default |Description |
- |-------------------------|------------|--------------|-------------------------------|
- | `xml_file` | **str** | `"walker2d.xml"` | Path to a MuJoCo model |
- | `forward_reward_weight` | **float** | `1.0` | Weight for *forward_reward* term (see section on reward) |
- | `ctrl_cost_weight` | **float** | `1e-3` | Weight for *ctr_cost* term (see section on reward) |
- | `healthy_reward` | **float** | `1.0` | Constant reward given if the ant is "healthy" after timestep |
- | `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the z-coordinate of the walker is no longer healthy |
- | `healthy_z_range` | **tuple** | `(0.8, 2)` | The z-coordinate of the top of the walker must be in this range to be considered healthy |
- | `healthy_angle_range` | **tuple** | `(-1, 1)` | The angle must be in this range to be considered healthy|
- | `reset_noise_scale` | **float** | `5e-3` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
- | `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
-
-
- ### Version History
-
- * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
- * v2: All continuous control environments now use mujoco_py >= 1.50
- * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
- * v0: Initial versions release (1.0.0)
-
- """
-
def __init__(
self,
xml_file="walker2d.xml",
@@ -172,7 +41,7 @@ def __init__(
exclude_current_positions_from_observation
)
- mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4, mujoco_bindings="mujoco_py")
@property
def healthy_reward(self):
diff --git a/gym/envs/mujoco/walker2d_v4.py b/gym/envs/mujoco/walker2d_v4.py
new file mode 100644
index 00000000000..c4256b3227f
--- /dev/null
+++ b/gym/envs/mujoco/walker2d_v4.py
@@ -0,0 +1,278 @@
+from turtle import distance
+
+import numpy as np
+
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+DEFAULT_CAMERA_CONFIG = {
+ "trackbodyid": 2,
+ "distance": 4.0,
+ "lookat": np.array((0.0, 0.0, 1.15)),
+ "elevation": -20.0,
+}
+
+DEFAULT_CAMERA_CONFIG = {
+ "trackbodyid": 2,
+ "distance": 4.0,
+ "lookat": np.array((0.0, 0.0, 1.15)),
+ "elevation": -20.0,
+}
+
+
+class Walker2dEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ """
+ ### Description
+
+ This environment builds on the hopper environment based on the work done by Erez, Tassa, and Todorov
+ in ["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf)
+ by adding another set of legs making it possiible for the robot to walker forward instead of
+ hop. Like other Mujoco environments, this environment aims to increase the number of independent state
+ and control variables as compared to the classic control environments. The walker is a
+ two-dimensional two-legged figure that consist of four main body parts - a single torso at the top
+ (with the two legs splitting after the torso), two thighs in the middle below the torso, two legs
+ in the bottom below the thighs, and two feet attached to the legs on which the entire body rests.
+ The goal is to make coordinate both sets of feet, legs, and thighs to move in the forward (right)
+ direction by applying torques on the six hinges connecting the six body parts.
+
+ ### Action Space
+ The agent take a 6-element vector for actions.
+ The action space is a continuous `(action, action, action, action, action, action)`
+ all in `[-1, 1]`, where `action` represents the numerical torques applied at the hinge joints.
+
+ | Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|----------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
+ | 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
+ | 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
+ | 2 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
+ | 3 | Torque applied on the left thigh rotor | -1 | 1 | thigh_left_joint | hinge | torque (N m) |
+ | 4 | Torque applied on the left leg rotor | -1 | 1 | leg_left_joint | hinge | torque (N m) |
+ | 5 | Torque applied on the left foot rotor | -1 | 1 | foot_left_joint | hinge | torque (N m) |
+
+ ### Observation Space
+
+ The state space consists of positional values of different body parts of the walker,
+ followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
+
+ The observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|--------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
+ | 0 | x-coordinate of the top | -Inf | Inf | rootx (torso) | slide | position (m) |
+ | 1 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz (torso) | slide | position (m) |
+ | 2 | angle of the top | -Inf | Inf | rooty (torso) | hinge | angle (rad) |
+ | 3 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
+ | 4 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
+ | 5 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
+ | 6 | angle of the left thigh joint | -Inf | Inf | thigh_left_joint | hinge | angle (rad) |
+ | 7 | angle of the left leg joint | -Inf | Inf | leg_left_joint | hinge | angle (rad) |
+ | 8 | angle of the left foot joint | -Inf | Inf | foot_left_joint | hinge | angle (rad) |
+ | 9 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 10 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 11 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 12 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
+ | 13 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
+ | 14 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
+ | 15 | angular velocity of the thigh hinge | -Inf | Inf | thigh_left_joint | hinge | angular velocity (rad/s) |
+ | 16 | angular velocity of the leg hinge | -Inf | Inf | leg_left_joint | hinge | angular velocity (rad/s) |
+ | 17 | angular velocity of the foot hinge | -Inf | Inf | foot_left_joint | hinge | angular velocity (rad/s) |
+
+
+
+ **Note:**
+ In practice (and Gym implementation), the first positional element is omitted from the
+ state space since the reward function is calculated based on that value. This value is
+ hidden from the algorithm, which in turn has to develop an abstract understanding of it
+ from the observed rewards. Therefore, observation space has shape `(17,)`
+ instead of `(18,)` and looks like:
+
+ | Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
+ |-----|--------------------------------------------------------|----------------|-----------------|----------------------------------------|-------|------|
+ | 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz (torso) | slide | position (m) |
+ | 1 | angle of the top | -Inf | Inf | rooty (torso) | hinge | angle (rad) |
+ | 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
+ | 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
+ | 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
+ | 5 | angle of the left thigh joint | -Inf | Inf | thigh_left_joint | hinge | angle (rad) |
+ | 6 | angle of the left leg joint | -Inf | Inf | leg_left_joint | hinge | angle (rad) |
+ | 7 | angle of the left foot joint | -Inf | Inf | foot_left_joint | hinge | angle (rad) |
+ | 8 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
+ | 9 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
+ | 10 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
+ | 11 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
+ | 12 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
+ | 13 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
+ | 14 | angular velocity of the thigh hinge | -Inf | Inf | thigh_left_joint | hinge | angular velocity (rad/s) |
+ | 15 | angular velocity of the leg hinge | -Inf | Inf | leg_left_joint | hinge | angular velocity (rad/s) |
+ | 16 | angular velocity of the foot hinge | -Inf | Inf | foot_left_joint | hinge | angular velocity (rad/s) |
+
+ ### Rewards
+ The reward consists of three parts:
+ - *alive bonus*: Every timestep that the walker is alive, it gets a reward of 1,
+ - *reward_forward*: A reward of walking forward which is measured as
+ *(x-coordinate before action - x-coordinate after action)/dt*.
+ *dt* is the time between actions and is dependeent on the frame_skip parameter
+ (default is 4), where the *dt* for one frame is 0.002 - making the default
+ *dt = 4 * 0.002 = 0.008*. This reward would be positive if the walker walks forward (right) desired.
+ - *reward_control*: A negative reward for penalising the walker if it
+ takes actions that are too large. It is measured as
+ *-coefficient **x** sum(action2)* where *coefficient* is
+ a parameter set for the control and has a default value of 0.001
+
+ The total reward returned is ***reward*** *=* *alive bonus + reward_forward + reward_control*
+
+ ### Starting State
+ All observations start in state
+ (0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
+ with a uniform noise in the range of [-0.005, 0.005] added to the values for stochasticity.
+
+ ### Episode Termination
+ The episode terminates when any of the following happens:
+
+ 1. The episode duration reaches a 1000 timesteps
+ 2. Any of the state space values is no longer finite
+ 3. The height of the walker (index 1) is ***not*** in the range `[0.8, 2]`
+ 4. The absolute value of the angle (index 2) is ***not*** in the range `[-1, 1]`
+
+ ### Arguments
+
+ No additional arguments are currently supported (in v2 and lower),
+ but modifications can be made to the XML file in the assets folder
+ (or by changing the path to a modified XML file in another folder)..
+
+ ```
+ env = gym.make('Walker2d-v2')
+ ```
+
+ v3 and beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
+
+ ```
+ env = gym.make('Walker2d-v3', ctrl_cost_weight=0.1, ....)
+ ```
+
+ ### Version History
+
+ * v4: all mujoco environments now use the mujoco binidings in mujoco>=2.1.3
+ * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
+ * v2: All continuous control environments now use mujoco_py >= 1.50
+ * v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
+ * v0: Initial versions release (1.0.0)
+
+ """
+
+ def __init__(
+ self,
+ xml_file="walker2d.xml",
+ forward_reward_weight=1.0,
+ ctrl_cost_weight=1e-3,
+ healthy_reward=1.0,
+ terminate_when_unhealthy=True,
+ healthy_z_range=(0.8, 2.0),
+ healthy_angle_range=(-1.0, 1.0),
+ reset_noise_scale=5e-3,
+ exclude_current_positions_from_observation=True,
+ ):
+ utils.EzPickle.__init__(**locals())
+
+ self._forward_reward_weight = forward_reward_weight
+ self._ctrl_cost_weight = ctrl_cost_weight
+
+ self._healthy_reward = healthy_reward
+ self._terminate_when_unhealthy = terminate_when_unhealthy
+
+ self._healthy_z_range = healthy_z_range
+ self._healthy_angle_range = healthy_angle_range
+
+ self._reset_noise_scale = reset_noise_scale
+
+ self._exclude_current_positions_from_observation = (
+ exclude_current_positions_from_observation
+ )
+
+ mujoco_env.MujocoEnv.__init__(self, xml_file, 4)
+
+ @property
+ def healthy_reward(self):
+ return (
+ float(self.is_healthy or self._terminate_when_unhealthy)
+ * self._healthy_reward
+ )
+
+ def control_cost(self, action):
+ control_cost = self._ctrl_cost_weight * np.sum(np.square(action))
+ return control_cost
+
+ @property
+ def is_healthy(self):
+ z, angle = self.data.qpos[1:3]
+
+ min_z, max_z = self._healthy_z_range
+ min_angle, max_angle = self._healthy_angle_range
+
+ healthy_z = min_z < z < max_z
+ healthy_angle = min_angle < angle < max_angle
+ is_healthy = healthy_z and healthy_angle
+
+ return is_healthy
+
+ @property
+ def done(self):
+ done = not self.is_healthy if self._terminate_when_unhealthy else False
+ return done
+
+ def _get_obs(self):
+ position = self.data.qpos.flat.copy()
+ velocity = np.clip(self.data.qvel.flat.copy(), -10, 10)
+
+ if self._exclude_current_positions_from_observation:
+ position = position[1:]
+
+ observation = np.concatenate((position, velocity)).ravel()
+ return observation
+
+ def step(self, action):
+ x_position_before = self.data.qpos[0]
+ self.do_simulation(action, self.frame_skip)
+ x_position_after = self.data.qpos[0]
+ x_velocity = (x_position_after - x_position_before) / self.dt
+
+ ctrl_cost = self.control_cost(action)
+
+ forward_reward = self._forward_reward_weight * x_velocity
+ healthy_reward = self.healthy_reward
+
+ rewards = forward_reward + healthy_reward
+ costs = ctrl_cost
+
+ observation = self._get_obs()
+ reward = rewards - costs
+ done = self.done
+ info = {
+ "x_position": x_position_after,
+ "x_velocity": x_velocity,
+ }
+
+ return observation, reward, done, info
+
+ def reset_model(self):
+ noise_low = -self._reset_noise_scale
+ noise_high = self._reset_noise_scale
+
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nq
+ )
+ qvel = self.init_qvel + self.np_random.uniform(
+ low=noise_low, high=noise_high, size=self.model.nv
+ )
+
+ self.set_state(qpos, qvel)
+
+ observation = self._get_obs()
+ return observation
+
+ def viewer_setup(self):
+ for key, value in DEFAULT_CAMERA_CONFIG.items():
+ if isinstance(value, np.ndarray):
+ getattr(self.viewer.cam, key)[:] = value
+ else:
+ setattr(self.viewer.cam, key, value)
diff --git a/gym/envs/registration.py b/gym/envs/registration.py
index f53ac621205..09471c62ded 100644
--- a/gym/envs/registration.py
+++ b/gym/envs/registration.py
@@ -90,6 +90,7 @@ class EnvSpec:
nondeterministic: Whether this environment is non-deterministic even after seeding
max_episode_steps: The maximum number of steps that an episode can consist of
order_enforce: Whether to wrap the environment in an orderEnforcing wrapper
+ autoreset: Whether the environment should automatically reset when it reaches the done state
kwargs: The kwargs to pass to the environment class
"""
@@ -100,6 +101,7 @@ class EnvSpec:
nondeterministic: bool = field(default=False)
max_episode_steps: Optional[int] = field(default=None)
order_enforce: bool = field(default=True)
+ autoreset: bool = field(default=False)
kwargs: dict = field(default_factory=dict)
namespace: Optional[str] = field(init=False)
name: str = field(init=False)
@@ -132,6 +134,10 @@ def make(self, **kwargs) -> Env:
_kwargs = self.kwargs.copy()
_kwargs.update(kwargs)
+ if "autoreset" in _kwargs:
+ self.autoreset = _kwargs["autoreset"]
+ del _kwargs["autoreset"]
+
if callable(self.entry_point):
env = self.entry_point(**_kwargs)
else:
@@ -142,15 +148,23 @@ def make(self, **kwargs) -> Env:
spec = copy.deepcopy(self)
spec.kwargs = _kwargs
env.unwrapped.spec = spec
+
if self.order_enforce:
from gym.wrappers.order_enforcing import OrderEnforcing
env = OrderEnforcing(env)
+
assert env.spec is not None, "expected spec to be set to the unwrapped env."
if env.spec.max_episode_steps is not None:
from gym.wrappers.time_limit import TimeLimit
env = TimeLimit(env, max_episode_steps=env.spec.max_episode_steps)
+
+ if self.autoreset:
+ from gym.wrappers.autoreset import AutoResetWrapper
+
+ env = AutoResetWrapper(env)
+
return env
@@ -635,7 +649,7 @@ def make(id: Literal["LunarLander-v2", "LunarLanderContinuous-v2"], **kwargs) ->
@overload
def make(id: Literal["BipedalWalker-v3", "BipedalWalkerHardcore-v3"], **kwargs) -> Env[np.ndarray, np.ndarray | Sequence[SupportsFloat]]: ...
@overload
-def make(id: Literal["CarRacing-v0"], **kwargs) -> Env[np.ndarray, np.ndarray | Sequence[SupportsFloat]]: ...
+def make(id: Literal["CarRacing-v1", "CarRacingDomainRandomize-v1"], **kwargs) -> Env[np.ndarray, np.ndarray | Sequence[SupportsFloat]]: ...
# Toy Text
# ----------------------------------------
diff --git a/gym/envs/toy_text/blackjack.py b/gym/envs/toy_text/blackjack.py
index 723f77012a0..fa384961579 100644
--- a/gym/envs/toy_text/blackjack.py
+++ b/gym/envs/toy_text/blackjack.py
@@ -94,12 +94,18 @@ class BlackjackEnv(gym.Env):
### Arguments
```
- gym.make('Blackjack-v1', natural=False)
+ gym.make('Blackjack-v1', natural=False, sab=False)
```
- `natural`: Whether to give an additional reward for
+ `natural=False`: Whether to give an additional reward for
starting with a natural blackjack, i.e. starting with an ace and ten (sum is 21).
+ `sab=False`: Whether to follow the exact rules outlined in the book by
+ Sutton and Barto. If `sab` is `True`, the keyword argument `natural` will be ignored.
+ If the player achieves a natural blackjack and the dealer does not, the player
+ will win (i.e. get a reward of +1). The reverse rule does not apply.
+ If both the player and the dealer get a natural, it will be a draw (i.e. reward 0).
+
### Version History
* v0: Initial versions release (1.0.0)
"""
diff --git a/gym/envs/toy_text/frozen_lake.py b/gym/envs/toy_text/frozen_lake.py
index 07a29a8faec..f08d60b2f59 100644
--- a/gym/envs/toy_text/frozen_lake.py
+++ b/gym/envs/toy_text/frozen_lake.py
@@ -278,12 +278,11 @@ def _render_gui(self, desc, mode):
]
self.elf_images = [pygame.image.load(f_name) for f_name in elfs]
- board = pygame.Surface(self.window_size, flags=SRCALPHA)
cell_width = self.window_size[0] // self.ncol
cell_height = self.window_size[1] // self.nrow
smaller_cell_scale = 0.6
- small_cell_w = smaller_cell_scale * cell_width
- small_cell_h = smaller_cell_scale * cell_height
+ small_cell_w = int(smaller_cell_scale * cell_width)
+ small_cell_h = int(smaller_cell_scale * cell_height)
# prepare images
last_action = self.lastaction if self.lastaction is not None else 1
@@ -321,7 +320,7 @@ def _render_gui(self, desc, mode):
else:
self.window_surface.blit(ice_img, (rect[0], rect[1]))
- pygame.draw.rect(board, (180, 200, 230), rect, 1)
+ pygame.draw.rect(self.window_surface, (180, 200, 230), rect, 1)
# paint the elf
bot_row, bot_col = self.s // self.ncol, self.s % self.ncol
@@ -337,7 +336,6 @@ def _render_gui(self, desc, mode):
elf_rect = self._center_small_rect(cell_rect, elf_img.get_size())
self.window_surface.blit(elf_img, elf_rect)
- self.window_surface.blit(board, board.get_rect())
if mode == "human":
pygame.event.pump()
pygame.display.update()
diff --git a/gym/envs/toy_text/taxi.py b/gym/envs/toy_text/taxi.py
index 3403276a5ef..bbe6fbf0946 100644
--- a/gym/envs/toy_text/taxi.py
+++ b/gym/envs/toy_text/taxi.py
@@ -241,7 +241,7 @@ def _render_gui(self, mode):
pygame.display.set_caption("Taxi")
if mode == "human":
self.window = pygame.display.set_mode(WINDOW_SIZE)
- else: # rgb_array
+ else: # "rgb_array"
self.window = pygame.Surface(WINDOW_SIZE)
if self.clock is None:
self.clock = pygame.time.Clock()
@@ -266,7 +266,6 @@ def _render_gui(self, mode):
self.destination_img = pygame.transform.scale(
pygame.image.load(file_name), self.cell_size
)
- self.destination_img = self.destination_img.convert_alpha()
self.destination_img.set_alpha(170)
if self.median_horiz is None:
file_names = [
diff --git a/gym/spaces/box.py b/gym/spaces/box.py
index 8c959fc34f3..c8d8c93d8aa 100644
--- a/gym/spaces/box.py
+++ b/gym/spaces/box.py
@@ -29,12 +29,12 @@ class Box(Space[np.ndarray]):
There are two common use cases:
* Identical bound for each dimension::
-
+
>>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
Box(3, 4)
* Independent bound for each dimension::
-
+
>>> Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), dtype=np.float32)
Box(2,)
@@ -66,9 +66,9 @@ def __init__(
# Capture the boundedness information before replacing np.inf with get_inf
_low = np.full(shape, low, dtype=float) if np.isscalar(low) else low
- self.bounded_below = -np.inf < _low
+ self.bounded_below = -np.inf < _low # type: ignore
_high = np.full(shape, high, dtype=float) if np.isscalar(high) else high
- self.bounded_above = np.inf > _high
+ self.bounded_above = np.inf > _high # type: ignore
low = _broadcast(low, dtype, shape, inf_sign="-") # type: ignore
high = _broadcast(high, dtype, shape, inf_sign="+") # type: ignore
diff --git a/gym/spaces/dict.py b/gym/spaces/dict.py
index 68e06a86981..e717a11db4f 100644
--- a/gym/spaces/dict.py
+++ b/gym/spaces/dict.py
@@ -18,7 +18,7 @@ class Dict(Space[TypingDict[str, Space]], Mapping):
self.observation_space = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
Example usage [nested]::
-
+
self.nested_observation_space = spaces.Dict({
'sensors': spaces.Dict({
'position': spaces.Box(low=-100, high=100, shape=(3,)),
diff --git a/gym/spaces/multi_binary.py b/gym/spaces/multi_binary.py
index 2d8eacb354f..8174d87b75c 100644
--- a/gym/spaces/multi_binary.py
+++ b/gym/spaces/multi_binary.py
@@ -31,11 +31,11 @@ def __init__(
):
if isinstance(n, (Sequence, np.ndarray)):
self.n = input_n = tuple(int(i) for i in n)
+ assert (np.asarray(input_n) > 0).all() # n (counts) have to be positive
else:
self.n = n = int(n)
input_n = (n,)
-
- assert (np.asarray(input_n) > 0).all(), "n (counts) have to be positive"
+ assert (np.asarray(input_n) > 0).all() # n (counts) have to be positive
super().__init__(input_n, np.int8, seed)
diff --git a/gym/spaces/multi_discrete.py b/gym/spaces/multi_discrete.py
index 87ae8bce1ef..c3dbc4e902f 100644
--- a/gym/spaces/multi_discrete.py
+++ b/gym/spaces/multi_discrete.py
@@ -1,6 +1,7 @@
from __future__ import annotations
from collections.abc import Sequence
+from typing import Iterable
import numpy as np
@@ -13,9 +14,9 @@
class MultiDiscrete(Space[np.ndarray]):
"""
The multi-discrete action space consists of a series of discrete action spaces with different number of actions in each. It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space. It is parametrized by passing an array of positive integers specifying number of actions for each discrete action space.
-
- Note:
-
+
+ Note:
+
Some environment wrappers assume a value of 0 always represents the NOOP action.
e.g. Nintendo Game Controller - Can be conceptualized as 3 discrete action spaces:
@@ -52,7 +53,7 @@ def contains(self, x) -> bool:
# is within correct bounds for space dtype (even though x does not have to be unsigned)
return bool(x.shape == self.shape and (0 <= x).all() and (x < self.nvec).all())
- def to_jsonable(self, sample_n):
+ def to_jsonable(self, sample_n: Iterable[np.ndarray]):
return [sample.tolist() for sample in sample_n]
def from_jsonable(self, sample_n):
@@ -66,7 +67,7 @@ def __getitem__(self, index):
if nvec.ndim == 0:
subspace = Discrete(nvec)
else:
- subspace = MultiDiscrete(nvec, self.dtype)
+ subspace = MultiDiscrete(nvec, self.dtype) # type: ignore
subspace.np_random.bit_generator.state = self.np_random.bit_generator.state
return subspace
diff --git a/gym/spaces/tuple.py b/gym/spaces/tuple.py
index eeb71bee2f2..fa9e2599834 100644
--- a/gym/spaces/tuple.py
+++ b/gym/spaces/tuple.py
@@ -72,7 +72,7 @@ def contains(self, x) -> bool:
def __repr__(self) -> str:
return "Tuple(" + ", ".join([str(s) for s in self.spaces]) + ")"
- def to_jsonable(self, sample_n) -> list:
+ def to_jsonable(self, sample_n: Sequence) -> list:
# serialize as list-repr of tuple of vectors
return [
space.to_jsonable([sample[i] for sample in sample_n])
diff --git a/gym/spaces/utils.py b/gym/spaces/utils.py
index fa7f3081cc3..1d0faac7b43 100644
--- a/gym/spaces/utils.py
+++ b/gym/spaces/utils.py
@@ -3,7 +3,7 @@
import operator as op
from collections import OrderedDict
from functools import reduce, singledispatch
-from typing import TypeVar, Union
+from typing import TypeVar, Union, cast
import numpy as np
@@ -19,7 +19,7 @@ def flatdim(space: Space) -> int:
the space is not defined in ``gym.spaces``.
Example usage::
-
+
>>> s = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
>>> spaces.flatdim(s)
5
@@ -133,7 +133,7 @@ def _unflatten_multidiscrete(space: MultiDiscrete, x: np.ndarray) -> np.ndarray:
offsets = np.zeros((space.nvec.size + 1,), dtype=space.dtype)
offsets[1:] = np.cumsum(space.nvec.flatten())
- (indices,) = np.nonzero(x)
+ (indices,) = cast(type(offsets[:-1]), np.nonzero(x))
return np.asarray(indices - offsets[:-1], dtype=space.dtype).reshape(space.shape)
diff --git a/gym/utils/play.py b/gym/utils/play.py
index 74667820581..f04ccae4a62 100644
--- a/gym/utils/play.py
+++ b/gym/utils/play.py
@@ -1,12 +1,16 @@
-import argparse
+from typing import Callable, Dict, Optional, Tuple
-import matplotlib
import pygame
+from numpy.typing import NDArray
+from pygame import Surface
+from pygame.event import Event
import gym
-from gym import logger
+from gym import Env, logger
try:
+ import matplotlib
+
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
except ImportError as e:
@@ -18,7 +22,71 @@
from pygame.locals import VIDEORESIZE
-def display_arr(screen, arr, video_size, transpose):
+class MissingKeysToAction(Exception):
+ """Raised when the environment does not have
+ a default keys_to_action mapping
+ """
+
+
+class PlayableGame:
+ def __init__(
+ self,
+ env: Env,
+ keys_to_action: Optional[Dict[Tuple[int], int]] = None,
+ zoom: Optional[float] = None,
+ ):
+ self.env = env
+ self.relevant_keys = self._get_relevant_keys(keys_to_action)
+ self.video_size = self._get_video_size(zoom)
+ self.screen = pygame.display.set_mode(self.video_size)
+ self.pressed_keys = []
+ self.running = True
+
+ def _get_relevant_keys(
+ self, keys_to_action: Optional[Dict[Tuple[int], int]] = None
+ ) -> set:
+ if keys_to_action is None:
+ if hasattr(self.env, "get_keys_to_action"):
+ keys_to_action = self.env.get_keys_to_action()
+ elif hasattr(self.env.unwrapped, "get_keys_to_action"):
+ keys_to_action = self.env.unwrapped.get_keys_to_action()
+ else:
+ raise MissingKeysToAction(
+ "%s does not have explicit key to action mapping, "
+ "please specify one manually" % self.env.spec.id
+ )
+ relevant_keys = set(sum((list(k) for k in keys_to_action.keys()), []))
+ return relevant_keys
+
+ def _get_video_size(self, zoom: Optional[float] = None) -> Tuple[int, int]:
+ # TODO: this needs to be updated when the render API change goes through
+ rendered = self.env.render(mode="rgb_array")
+ video_size = [rendered.shape[1], rendered.shape[0]]
+
+ if zoom is not None:
+ video_size = int(video_size[0] * zoom), int(video_size[1] * zoom)
+
+ return video_size
+
+ def process_event(self, event: Event) -> None:
+ if event.type == pygame.KEYDOWN:
+ if event.key in self.relevant_keys:
+ self.pressed_keys.append(event.key)
+ elif event.key == pygame.K_ESCAPE:
+ self.running = False
+ elif event.type == pygame.KEYUP:
+ if event.key in self.relevant_keys:
+ self.pressed_keys.remove(event.key)
+ elif event.type == pygame.QUIT:
+ self.running = False
+ elif event.type == VIDEORESIZE:
+ self.video_size = event.size
+ self.screen = pygame.display.set_mode(self.video_size)
+
+
+def display_arr(
+ screen: Surface, arr: NDArray, video_size: Tuple[int, int], transpose: bool
+):
arr_min, arr_max = arr.min(), arr.max()
arr = 255.0 * (arr - arr_min) / (arr_max - arr_min)
pyg_img = pygame.surfarray.make_surface(arr.swapaxes(0, 1) if transpose else arr)
@@ -26,7 +94,15 @@ def display_arr(screen, arr, video_size, transpose):
screen.blit(pyg_img, (0, 0))
-def play(env, transpose=True, fps=30, zoom=None, callback=None, keys_to_action=None):
+def play(
+ env: Env,
+ transpose: Optional[bool] = True,
+ fps: Optional[int] = 30,
+ zoom: Optional[float] = None,
+ callback: Optional[Callable] = None,
+ keys_to_action: Optional[Dict[Tuple[int], int]] = None,
+ seed: Optional[int] = None,
+):
"""Allows one to play the game using keyboard.
To simply play the game use:
@@ -81,65 +157,35 @@ def callback(obs_t, obs_tp1, action, rew, done, info):
# ...
}
If None, default key_to_action mapping for that env is used, if provided.
+ seed: bool or None
+ Random seed used when resetting the environment. If None, no seed is used.
"""
- env.reset()
- rendered = env.render(mode="rgb_array")
-
- if keys_to_action is None:
- if hasattr(env, "get_keys_to_action"):
- keys_to_action = env.get_keys_to_action()
- elif hasattr(env.unwrapped, "get_keys_to_action"):
- keys_to_action = env.unwrapped.get_keys_to_action()
- else:
- assert False, (
- env.spec.id
- + " does not have explicit key to action mapping, "
- + "please specify one manually"
- )
- relevant_keys = set(sum(map(list, keys_to_action.keys()), []))
-
- video_size = [rendered.shape[1], rendered.shape[0]]
- if zoom is not None:
- video_size = int(video_size[0] * zoom), int(video_size[1] * zoom)
+ env.reset(seed=seed)
+ game = PlayableGame(env, keys_to_action, zoom)
- pressed_keys = []
- running = True
- env_done = True
-
- screen = pygame.display.set_mode(video_size)
+ done = True
clock = pygame.time.Clock()
- while running:
- if env_done:
- env_done = False
- obs = env.reset()
+ while game.running:
+ if done:
+ done = False
+ obs = env.reset(seed=seed)
else:
- action = keys_to_action.get(tuple(sorted(pressed_keys)), 0)
+ action = keys_to_action.get(tuple(sorted(game.pressed_keys)), 0)
prev_obs = obs
- obs, rew, env_done, info = env.step(action)
+ obs, rew, done, info = env.step(action)
if callback is not None:
- callback(prev_obs, obs, action, rew, env_done, info)
+ callback(prev_obs, obs, action, rew, done, info)
if obs is not None:
+ # TODO: this needs to be updated when the render API change goes through
rendered = env.render(mode="rgb_array")
- display_arr(screen, rendered, transpose=transpose, video_size=video_size)
+ display_arr(
+ game.screen, rendered, transpose=transpose, video_size=game.video_size
+ )
# process pygame events
for event in pygame.event.get():
- # test events, set key states
- if event.type == pygame.KEYDOWN:
- if event.key in relevant_keys:
- pressed_keys.append(event.key)
- elif event.key == 27:
- running = False
- elif event.type == pygame.KEYUP:
- if event.key in relevant_keys:
- pressed_keys.remove(event.key)
- elif event.type == pygame.QUIT:
- running = False
- elif event.type == VIDEORESIZE:
- video_size = event.size
- screen = pygame.display.set_mode(video_size)
- print(video_size)
+ game.process_event(event)
pygame.display.flip()
clock.tick(fps)
@@ -180,20 +226,3 @@ def callback(self, obs_t, obs_tp1, action, rew, done, info):
)
self.ax[i].set_xlim(xmin, xmax)
plt.pause(0.000001)
-
-
-def main():
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "--env",
- type=str,
- default="MontezumaRevengeNoFrameskip-v4",
- help="Define Environment",
- )
- args = parser.parse_args()
- env = gym.make(args.env)
- play(env, zoom=4, fps=60)
-
-
-if __name__ == "__main__":
- main()
diff --git a/py.Dockerfile b/py.Dockerfile
index f6221a9f11f..ce20b8430b9 100644
--- a/py.Dockerfile
+++ b/py.Dockerfile
@@ -6,15 +6,14 @@ RUN apt-get -y update && apt-get install -y unzip libglu1-mesa-dev libgl1-mesa-d
# Download mujoco
RUN mkdir /root/.mujoco && \
cd /root/.mujoco && \
- curl -O https://www.roboti.us/download/mjpro150_linux.zip && \
- unzip mjpro150_linux.zip && \
- echo DUMMY_KEY > /root/.mujoco/mjkey.txt
+ wget https://github.com/deepmind/mujoco/releases/download/2.1.0/mujoco210-linux-x86_64.tar.gz &&\
+ tar -xf mujoco210-linux-x86_64.tar.gz
-ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/.mujoco/mjpro150/bin
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/.mujoco/mujoco210/bin
COPY . /usr/local/gym/
WORKDIR /usr/local/gym/
-RUN pip install .[nomujoco] && pip install -r test_requirements.txt
+RUN pip install .[noatari] && pip install -r test_requirements.txt
ENTRYPOINT ["/usr/local/gym/bin/docker_entrypoint"]
diff --git a/requirements.txt b/requirements.txt
index 5055547c90a..0e5e5f91c15 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,7 +1,6 @@
ale-py~=0.7
opencv-python>=3.0
box2d-py==2.3.5
-mujoco_py>=1.50, <2.0
scipy>=1.4.1
numpy>=1.18.0
pyglet>=1.4.0
diff --git a/setup.py b/setup.py
index a4317a7bb32..7734872bcc3 100644
--- a/setup.py
+++ b/setup.py
@@ -14,19 +14,25 @@
"accept-rom-license": ["autorom[accept-rom-license]~=0.4.2"],
"box2d": ["box2d-py==2.3.5", "pygame==2.1.0"],
"classic_control": ["pygame==2.1.0"],
- "mujoco": ["mujoco_py>=1.50, <2.0"],
+ "mujoco_py": ["mujoco_py<2.2,>=2.1"],
+ "mujoco": ["mujoco==2.1.5", "imageio>=2.14.1"],
"toy_text": ["pygame==2.1.0", "scipy>=1.4.1"],
"other": ["lz4>=3.1.0", "opencv-python>=3.0"],
}
# Meta dependency groups.
-nomujoco_blacklist = {"mujoco", "accept-rom-license", "atari"}
+nomujoco_blacklist = {"mujoco_py", "mujoco", "accept-rom-license", "atari"}
nomujoco_groups = set(extras.keys()) - nomujoco_blacklist
extras["nomujoco"] = list(
itertools.chain.from_iterable(map(lambda group: extras[group], nomujoco_groups))
)
+noatari_blacklist = {"accept-rom-license", "atari"}
+noatari_groups = set(extras.keys()) - noatari_blacklist
+extras["noatari"] = list(
+ itertools.chain.from_iterable(map(lambda group: extras[group], noatari_groups))
+)
all_blacklist = {"accept-rom-license"}
all_groups = set(extras.keys()) - all_blacklist
diff --git a/tests/envs/spec_list.py b/tests/envs/spec_list.py
index 11c816f6bd1..7e0672e3ab6 100644
--- a/tests/envs/spec_list.py
+++ b/tests/envs/spec_list.py
@@ -2,18 +2,15 @@
from gym import envs, logger
-SKIP_MUJOCO_WARNING_MESSAGE = (
- "Cannot run mujoco test (either license key not found or mujoco not"
- "installed properly)."
+SKIP_MUJOCO_V3_WARNING_MESSAGE = (
+ "Cannot run mujoco test because mujoco-py is not installed"
)
-
-skip_mujoco = not (os.environ.get("MUJOCO_KEY"))
-if not skip_mujoco:
- try:
- import mujoco_py
- except ImportError:
- skip_mujoco = True
+skip_mujoco_v3 = False
+try:
+ import mujoco_py
+except ImportError:
+ skip_mujoco_v3 = True
def should_skip_env_spec_for_tests(spec):
@@ -21,7 +18,7 @@ def should_skip_env_spec_for_tests(spec):
# troublesome to run frequently
ep = spec.entry_point
# Skip mujoco tests for pull request CI
- if skip_mujoco and ep.startswith("gym.envs.mujoco"):
+ if skip_mujoco_v3 and ep.startswith("gym.envs.mujoco"):
return True
try:
import gym.envs.atari
@@ -48,8 +45,19 @@ def should_skip_env_spec_for_tests(spec):
return False
+def skip_mujoco_py_env_for_test(spec):
+ ep = spec.entry_point
+ version = spec.version
+ if ep.startswith("gym.envs.mujoco") and version < 4:
+ return True
+ return False
+
+
spec_list = [
spec
for spec in sorted(envs.registry.all(), key=lambda x: x.id)
if spec.entry_point is not None and not should_skip_env_spec_for_tests(spec)
]
+spec_list_no_mujoco_py = [
+ spec for spec in spec_list if not skip_mujoco_py_env_for_test(spec)
+]
diff --git a/tests/envs/test_action_dim_check.py b/tests/envs/test_action_dim_check.py
index 448847cd6ec..054fcb35687 100644
--- a/tests/envs/test_action_dim_check.py
+++ b/tests/envs/test_action_dim_check.py
@@ -3,12 +3,12 @@
import pytest
from gym import envs
-from tests.envs.spec_list import SKIP_MUJOCO_WARNING_MESSAGE, skip_mujoco
+from tests.envs.spec_list import SKIP_MUJOCO_V3_WARNING_MESSAGE, skip_mujoco_v3
ENVIRONMENT_IDS = ("HalfCheetah-v2",)
-@pytest.mark.skipif(skip_mujoco, reason=SKIP_MUJOCO_WARNING_MESSAGE)
+@pytest.mark.skipif(skip_mujoco_v3, reason=SKIP_MUJOCO_V3_WARNING_MESSAGE)
@pytest.mark.parametrize("environment_id", ENVIRONMENT_IDS)
def test_serialize_deserialize(environment_id):
env = envs.make(environment_id)
diff --git a/tests/envs/test_determinism.py b/tests/envs/test_determinism.py
index bb5e1fbd5cd..fef13aacd69 100644
--- a/tests/envs/test_determinism.py
+++ b/tests/envs/test_determinism.py
@@ -4,7 +4,7 @@
from tests.envs.spec_list import spec_list
-@pytest.mark.parametrize("spec", spec_list)
+@pytest.mark.parametrize("spec", spec_list, ids=[spec.id for spec in spec_list])
def test_env(spec):
# Note that this precludes running this test in multiple
# threads. However, we probably already can't do multithreading
diff --git a/tests/envs/test_envs.py b/tests/envs/test_envs.py
index 6d59057a617..b0a6701de1f 100644
--- a/tests/envs/test_envs.py
+++ b/tests/envs/test_envs.py
@@ -4,7 +4,7 @@
from gym import envs
from gym.spaces import Box
from gym.utils.env_checker import check_env
-from tests.envs.spec_list import spec_list
+from tests.envs.spec_list import spec_list, spec_list_no_mujoco_py
# This runs a smoketest on each official registered env. We may want
@@ -13,7 +13,9 @@
@pytest.mark.filterwarnings(
"ignore:.*We recommend you to use a symmetric and normalized Box action space.*"
)
-@pytest.mark.parametrize("spec", spec_list)
+@pytest.mark.parametrize(
+ "spec", spec_list_no_mujoco_py, ids=[spec.id for spec in spec_list_no_mujoco_py]
+)
def test_env(spec):
# Capture warnings
with pytest.warns(None) as warnings:
@@ -47,18 +49,20 @@ def test_env(spec):
assert (
observation.dtype == ob_space.dtype
), f"Step observation dtype: {ob.dtype}, expected: {ob_space.dtype}"
-
for mode in env.metadata.get("render_modes", []):
- env.render(mode=mode)
+ if not (mode == "human" and spec.entry_point.startswith("gym.envs.mujoco")):
+ env.render(mode=mode)
# Make sure we can render the environment after close.
for mode in env.metadata.get("render_modes", []):
- env.render(mode=mode)
+ if not (mode == "human" and spec.entry_point.startswith("gym.envs.mujoco")):
+
+ env.render(mode=mode)
env.close()
-@pytest.mark.parametrize("spec", spec_list)
+@pytest.mark.parametrize("spec", spec_list, ids=[spec.id for spec in spec_list])
def test_reset_info(spec):
with pytest.warns(None) as warnings:
diff --git a/tests/envs/test_lunar_lander.py b/tests/envs/test_lunar_lander.py
index 5aeab73f594..b15645859d9 100644
--- a/tests/envs/test_lunar_lander.py
+++ b/tests/envs/test_lunar_lander.py
@@ -18,6 +18,16 @@ def test_lunar_lander_continuous():
_test_lander(LunarLander(continuous=True), seed=0)
+@pytest.mark.skipif(Box2D is None, reason="Box2D not installed")
+def test_lunar_lander_wind():
+ _test_lander(LunarLander(enable_wind=True), seed=0)
+
+
+@pytest.mark.skipif(Box2D is None, reason="Box2D not installed")
+def test_lunar_lander_wind_continuous():
+ _test_lander(LunarLander(continuous=True, enable_wind=True), seed=0)
+
+
@pytest.mark.skipif(Box2D is None, reason="Box2D not installed")
def _test_lander(env, seed=None, render=False):
total_reward = demo_heuristic_lander(env, seed=seed, render=render)
diff --git a/tests/envs/test_mujoco_v2_to_v3_conversion.py b/tests/envs/test_mujoco_v2_to_v3_conversion.py
index 201d497667e..db540024388 100644
--- a/tests/envs/test_mujoco_v2_to_v3_conversion.py
+++ b/tests/envs/test_mujoco_v2_to_v3_conversion.py
@@ -3,7 +3,7 @@
import numpy as np
from gym import envs
-from tests.envs.spec_list import SKIP_MUJOCO_WARNING_MESSAGE, skip_mujoco
+from tests.envs.spec_list import SKIP_MUJOCO_V3_WARNING_MESSAGE, skip_mujoco_v3
def verify_environments_match(
@@ -31,7 +31,7 @@ def verify_environments_match(
np.testing.assert_allclose(old_info[key], new_info[key], atol=eps)
-@unittest.skipIf(skip_mujoco, SKIP_MUJOCO_WARNING_MESSAGE)
+@unittest.skipIf(skip_mujoco_v3, SKIP_MUJOCO_V3_WARNING_MESSAGE)
class Mujocov2Tov3ConversionTest(unittest.TestCase):
def test_environments_match(self):
test_cases = (
diff --git a/tests/utils/test_play.py b/tests/utils/test_play.py
new file mode 100644
index 00000000000..0f0ee1e46eb
--- /dev/null
+++ b/tests/utils/test_play.py
@@ -0,0 +1,213 @@
+from dataclasses import dataclass, field
+from typing import Callable, Optional, Tuple
+
+import numpy as np
+import pygame
+import pytest
+from pygame import KEYDOWN, KEYUP, QUIT, event
+from pygame.event import Event
+
+import gym
+from gym.utils.play import MissingKeysToAction, PlayableGame, play
+
+RELEVANT_KEY_1 = ord("a") # 97
+RELEVANT_KEY_2 = ord("d") # 100
+IRRELEVANT_KEY = 1
+
+
+@dataclass
+class DummyEnvSpec:
+ id: str
+
+
+class DummyPlayEnv(gym.Env):
+ def step(self, action):
+ obs = np.zeros((1, 1))
+ rew, done, info = 1, False, {}
+ return obs, rew, done, info
+
+ def reset(self, seed=None):
+ ...
+
+ def render(self, mode="rgb_array"):
+ return np.zeros((1, 1))
+
+
+class PlayStatus:
+ def __init__(self, callback: Callable):
+ self.data_callback = callback
+ self.cumulative_reward = 0
+ self.last_observation = None
+
+ def callback(self, obs_t, obs_tp1, action, rew, done, info):
+ _, obs_tp1, _, rew, _, _ = self.data_callback(
+ obs_t, obs_tp1, action, rew, done, info
+ )
+ self.cumulative_reward += rew
+ self.last_observation = obs_tp1
+
+
+def dummy_keys_to_action():
+ return {(RELEVANT_KEY_1,): 0, (RELEVANT_KEY_2,): 1}
+
+
+@pytest.fixture(autouse=True)
+def close_pygame():
+ yield
+ pygame.quit()
+
+
+def test_play_relevant_keys():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ assert game.relevant_keys == {RELEVANT_KEY_1, RELEVANT_KEY_2}
+
+
+def test_play_relevant_keys_no_mapping():
+ env = DummyPlayEnv()
+ env.spec = DummyEnvSpec("DummyPlayEnv")
+
+ with pytest.raises(MissingKeysToAction) as info:
+ PlayableGame(env)
+
+
+def test_play_relevant_keys_with_env_attribute():
+ """Env has a keys_to_action attribute"""
+ env = DummyPlayEnv()
+ env.get_keys_to_action = dummy_keys_to_action
+ game = PlayableGame(env)
+ assert game.relevant_keys == {97, 100}
+
+
+def test_video_size_no_zoom():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ assert game.video_size == list(env.render().shape)
+
+
+def test_video_size_zoom():
+ env = DummyPlayEnv()
+ zoom = 2.2
+ game = PlayableGame(env, dummy_keys_to_action(), zoom)
+ assert game.video_size == tuple(int(shape * zoom) for shape in env.render().shape)
+
+
+def test_keyboard_quit_event():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ event = Event(pygame.KEYDOWN, {"key": pygame.K_ESCAPE})
+ assert game.running == True
+ game.process_event(event)
+ assert game.running == False
+
+
+def test_pygame_quit_event():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ event = Event(pygame.QUIT)
+ assert game.running == True
+ game.process_event(event)
+ assert game.running == False
+
+
+def test_keyboard_relevant_keydown_event():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ event = Event(pygame.KEYDOWN, {"key": RELEVANT_KEY_1})
+ game.process_event(event)
+ assert game.pressed_keys == [RELEVANT_KEY_1]
+
+
+def test_keyboard_irrelevant_keydown_event():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ event = Event(pygame.KEYDOWN, {"key": IRRELEVANT_KEY})
+ game.process_event(event)
+ assert game.pressed_keys == []
+
+
+def test_keyboard_keyup_event():
+ env = DummyPlayEnv()
+ game = PlayableGame(env, dummy_keys_to_action())
+ event = Event(pygame.KEYDOWN, {"key": RELEVANT_KEY_1})
+ game.process_event(event)
+ event = Event(pygame.KEYUP, {"key": RELEVANT_KEY_1})
+ game.process_event(event)
+ assert game.pressed_keys == []
+
+
+def test_play_loop():
+ # set of key events to inject into the play loop as callback
+ callback_events = [
+ Event(KEYDOWN, {"key": RELEVANT_KEY_1}),
+ Event(KEYDOWN, {"key": RELEVANT_KEY_1}),
+ Event(QUIT),
+ ]
+
+ def callback(obs_t, obs_tp1, action, rew, done, info):
+ event.post(callback_events.pop(0))
+ return obs_t, obs_tp1, action, rew, done, info
+
+ env = DummyPlayEnv()
+ cumulative_env_reward = 0
+ for s in range(
+ len(callback_events)
+ ): # we run the same number of steps executed with play()
+ _, rew, _, _ = env.step(None)
+ cumulative_env_reward += rew
+
+ env_play = DummyPlayEnv()
+ status = PlayStatus(callback)
+ play(env_play, callback=status.callback, keys_to_action=dummy_keys_to_action())
+
+ assert status.cumulative_reward == cumulative_env_reward
+
+
+def test_play_loop_real_env():
+ SEED = 42
+ ENV = "CartPole-v1"
+
+ # set of key events to inject into the play loop as callback
+ callback_events = [
+ Event(KEYDOWN, {"key": RELEVANT_KEY_1}),
+ Event(KEYUP, {"key": RELEVANT_KEY_1}),
+ Event(KEYDOWN, {"key": RELEVANT_KEY_2}),
+ Event(KEYUP, {"key": RELEVANT_KEY_2}),
+ Event(KEYDOWN, {"key": RELEVANT_KEY_1}),
+ Event(KEYUP, {"key": RELEVANT_KEY_1}),
+ Event(KEYDOWN, {"key": RELEVANT_KEY_1}),
+ Event(KEYUP, {"key": RELEVANT_KEY_1}),
+ Event(KEYDOWN, {"key": RELEVANT_KEY_2}),
+ Event(KEYUP, {"key": RELEVANT_KEY_2}),
+ Event(QUIT),
+ ]
+ keydown_events = [k for k in callback_events if k.type == KEYDOWN]
+
+ def callback(obs_t, obs_tp1, action, rew, done, info):
+ pygame_event = callback_events.pop(0)
+ event.post(pygame_event)
+
+ # after releasing a key, post new events until
+ # we have one keydown
+ while pygame_event.type == KEYUP:
+ pygame_event = callback_events.pop(0)
+ event.post(pygame_event)
+
+ return obs_t, obs_tp1, action, rew, done, info
+
+ env = gym.make(ENV)
+ env.reset(seed=SEED)
+ keys_to_action = dummy_keys_to_action()
+
+ # first action is 0 because at the first iteration
+ # we can not inject a callback event into play()
+ env.step(0)
+ for e in keydown_events:
+ action = keys_to_action[(e.key,)]
+ obs, _, _, _ = env.step(action)
+
+ env_play = gym.make(ENV)
+ status = PlayStatus(callback)
+ play(env_play, callback=status.callback, keys_to_action=keys_to_action, seed=SEED)
+
+ assert (status.last_observation == obs).all()
diff --git a/tests/wrappers/test_autoreset.py b/tests/wrappers/test_autoreset.py
index f003a7280df..76c035f87dd 100644
--- a/tests/wrappers/test_autoreset.py
+++ b/tests/wrappers/test_autoreset.py
@@ -1,10 +1,13 @@
+import types
from typing import Optional
+from unittest.mock import MagicMock
import numpy as np
import pytest
import gym
from gym.wrappers import AutoResetWrapper
+from tests.envs.spec_list import spec_list
class DummyResetEnv(gym.Env):
@@ -62,6 +65,49 @@ def test_autoreset_reset_info():
assert isinstance(info, dict)
+@pytest.mark.parametrize("spec", spec_list, ids=[spec.id for spec in spec_list])
+def test_make_autoreset_true(spec):
+ """
+ Note: This test assumes that the outermost wrapper is AutoResetWrapper
+ so if that is being changed in the future, this test will break and need
+ to be updated.
+ Note: This test assumes that all first-party environments will terminate in a finite
+ amount of time with random actions, which is true as of the time of adding this test.
+ """
+ env = None
+ with pytest.warns(None) as warnings:
+ env = spec.make(autoreset=True)
+
+ ob_space = env.observation_space
+ obs = env.reset(seed=0)
+ env.action_space.seed(0)
+
+ env.unwrapped.reset = MagicMock(side_effect=env.unwrapped.reset)
+
+ done = False
+ while not done:
+ obs, reward, done, info = env.step(env.action_space.sample())
+
+ assert isinstance(env, AutoResetWrapper)
+ assert env.unwrapped.reset.called
+
+
+@pytest.mark.parametrize("spec", spec_list, ids=[spec.id for spec in spec_list])
+def test_make_autoreset_false(spec):
+ env = None
+ with pytest.warns(None) as warnings:
+ env = spec.make(autoreset=False)
+ assert not isinstance(env, AutoResetWrapper)
+
+
+@pytest.mark.parametrize("spec", spec_list, ids=[spec.id for spec in spec_list])
+def test_make_autoreset_default_false(spec):
+ env = None
+ with pytest.warns(None) as warnings:
+ env = spec.make()
+ assert not isinstance(env, AutoResetWrapper)
+
+
def test_autoreset_autoreset():
env = DummyResetEnv()
env = AutoResetWrapper(env)