Environments

In mushroom_rl we distinguish between two different types of environment classes:

proper environments
generators

While environments directly implement the Environment interface, generators are a set of methods used to generate finite markov chains that represent a specific environment e.g., grid worlds.

Environments

Atari

class Atari(name, width=84, height=84, full_action_space=False, repeat_action_probability=0.25, frameskip=4, headless=False)[source]

Bases: Environment

The Atari environment as presented in: “Human-level control through deep reinforcement learning”. Mnih et. al.. 2015.

__init__(name, width=84, height=84, full_action_space=False, repeat_action_probability=0.25, frameskip=4, headless=False)[source]

Constructor.

Parameters:

name (str) – id name of the Atari game in Gymnasium;
width (int, 84) – width of the screen;
height (int, 84) – height of the screen;
frameskip (int, 4) – number of frames to skip (repeat action) per step;
headless (bool, False) – if True, the rendering is forced to be headless.

seed(seed)[source]

Set the seed of the environment.

Parameters:: seed (float) – the value of the seed.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Car on hill

class CarOnHill(horizon=100, gamma=0.95)[source]

Bases: Environment

The Car On Hill environment as presented in: “Tree-Based Batch Mode Reinforcement Learning”. Ernst D. et al.. 2005.

__init__(horizon=100, gamma=0.95)[source]

Constructor.

Parameters:

horizon (int, 100) – horizon of the problem;
gamma (float, .95) – discount factor.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

DeepMind Control Suite

class DMControl(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0, use_pixels=False, pixels_width=64, pixels_height=64)[source]

Bases: Environment

Interface for dm_control suite Mujoco environments. It makes it possible to use every dm_control suite Mujoco environment just providing the necessary information.

__init__(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0, use_pixels=False, pixels_width=64, pixels_height=64)[source]

Constructor.

Parameters:

domain_name (str) – name of the environment;
task_name (str) – name of the task of the environment;
horizon (int) – the horizon;
gamma (float) – the discount factor;
task_kwargs (dict, None) – parameters of the task;
dt (float, .01) – duration of a control step;
width_screen (int, 480) – width of the screen;
height_screen (int, 480) – height of the screen;
camera_id (int, 0) – position of camera to render the environment;
use_pixels (bool, False) – if True, pixel observations are used rather than the state vector;
pixels_width (int, 64) – width of the pixel observation;
pixels_height (int, 64) – height of the pixel observation;

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Finite MDP

class FiniteMDP(p, rew, mu=None, gamma=0.9, horizon=inf, dt=0.1)[source]

Bases: Environment

Finite Markov Decision Process.

__init__(p, rew, mu=None, gamma=0.9, horizon=inf, dt=0.1)[source]

Constructor.

Parameters:

p (np.ndarray) – transition probability matrix;
rew (np.ndarray) – reward matrix;
mu (np.ndarray, None) – initial state probability distribution;
gamma (float, .9) – discount factor;
horizon (int, np.inf) – the horizon;
dt (float, 1e-1) – the control timestep of the environment.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

Grid World

class AbstractGridWorld(mdp_info, height, width, start, goal)[source]

Bases: Environment

Abstract class to build a grid world.

__init__(mdp_info, height, width, start, goal)[source]

Constructor.

Parameters:

height (int) – height of the grid;
width (int) – width of the grid;
start (tuple) – x-y coordinates of the goal;
goal (tuple) – x-y coordinates of the goal.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class GridWorld(height, width, goal, start=(0, 0), dt=0.1)[source]

Bases: AbstractGridWorld

Standard grid world.

__init__(height, width, goal, start=(0, 0), dt=0.1)[source]

Constructor

Parameters:

height (int) – height of the grid;
width (int) – width of the grid;
goal (tuple) – 2D coordinates of the goal state;
start (tuple, (0, 0)) – 2D coordinates of the starting state;
dt (float, 0.1) – the control timestep of the environment.

class GridWorldVanHasselt(height=3, width=3, goal=(0, 2), start=(2, 0), dt=0.1)[source]

Bases: AbstractGridWorld

A variant of the grid world as presented in: “Double Q-Learning”. Hasselt H. V.. 2010.

__init__(height=3, width=3, goal=(0, 2), start=(2, 0), dt=0.1)[source]

Constructor

Parameters:

height (int, 3) – height of the grid;
width (int, 3) – width of the grid;
goal (tuple, (0, 2)) – 2D coordinates of the goal state;
start (tuple, (2, 0)) – 2D coordinates of the starting state;
dt (float, 0.1) – the control timestep of the environment.

Gymnasium

class Gymnasium(name, horizon=None, gamma=0.99, headless=False, wrappers=None, wrappers_args=None, **env_args)[source]

Bases: Environment

Interface for Gymnasium environments. It makes it possible to use every Gymnasium environment just providing the id, except for the Atari games that are managed in a separate class.

__init__(name, horizon=None, gamma=0.99, headless=False, wrappers=None, wrappers_args=None, **env_args)[source]

Constructor.

Parameters:

name (str) – gymnasium id of the environment;
horizon (int) – the horizon. If None, use the one from Gymnasium;
gamma (float, 0.99) – the discount factor;
headless (bool, False) – If True, the rendering is forced to be headless.
wrappers – list of wrappers to apply over the environment. It is possible to pass arguments to the wrappers by providing a tuple with two elements: the gym wrapper class and a dictionary containing the parameters needed by the wrapper constructor;

seed(seed)[source]

Set the seed of the environment.

Parameters:: seed (float) – the value of the seed.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Inverted pendulum

class InvertedPendulum(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Bases: Environment

The Inverted Pendulum environment (continuous version) as presented in: “Reinforcement Learning In Continuous Time and Space”. Doya K.. 2000. “Off-Policy Actor-Critic”. Degris T. et al.. 2012. “Deterministic Policy Gradient Algorithms”. Silver D. et al. 2014.

__init__(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Constructor.

Parameters:

random_start (bool, False) – whether to start from a random position or from the horizontal one;
m (float, 1.0) – mass of the pendulum;
l (float, 1.0) – length of the pendulum;
g (float, 9.8) – gravity acceleration constant;
mu (float, 1e-2) – friction constant of the pendulum;
max_u (float, 5.0) – maximum allowed input torque;
horizon (int, 5000) – horizon of the problem;
gamma (int, .99) – discount factor.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Cart Pole

class CartPole(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=200, gamma=0.95)[source]

Bases: Environment

The Inverted Pendulum on a Cart environment as presented in: “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=200, gamma=0.95)[source]

Constructor.

Parameters:

m (float, 2.0) – mass of the pendulum;
M (float, 8.0) – mass of the cart;
l (float, .5) – length of the pendulum;
g (float, 9.8) – gravity acceleration constant;
max_u (float, 50.) – maximum allowed input torque;
noise_u (float, 10.) – maximum noise on the action;
horizon (int, 3000) – horizon of the problem;
gamma (float, .95) – discount factor.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

LQR

class LQR(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None, dt=0.1)[source]

Bases: Environment

This class implements a Linear-Quadratic Regulator. This task aims to minimize the undesired deviations from nominal values of some controller settings in control problems. The system equations in this task are:

\[x_{t+1} = Ax_t + Bu_t\]

where x is the state and u is the control signal.

The reward function is given by:

\[r_t = -\left( x_t^TQx_t + u_t^TRu_t \right)\]

“Policy gradient approaches for multi-objective sequential decision making”. Parisi S., Pirotta M., Smacchia N., Bascetta L., Restelli M.. 2014

__init__(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None, dt=0.1)[source]

Constructor.

Args:
A (np.ndarray): the state dynamics matrix; B (np.ndarray): the action dynamics matrix; Q (np.ndarray): reward weight matrix for state; R (np.ndarray): reward weight matrix for action; max_pos (float, np.inf): maximum value of the state; max_action (float, np.inf): maximum value of the action; random_init (bool, False): start from a random state; episodic (bool, False): end the episode when the state goes over the threshold; gamma (float, 0.9): discount factor; horizon (int, 50): horizon of the env; dt (float, 0.1): the control timestep of the environment.

static generate(dimensions=None, s_dim=None, a_dim=None, max_pos=inf, max_action=inf, eps=0.1, index=0, scale=1.0, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Factory method that generates an lqr with identity dynamics and symmetric reward matrices.

Parameters:

dimensions (int) – number of state-action dimensions;
s_dim (int) – number of state dimensions;
a_dim (int) – number of action dimensions;
max_pos (float, np.inf) – maximum value of the state;
max_action (float, np.inf) – maximum value of the action;
eps (double, .1) – reward matrix weights specifier;
index (int, 0) – selector for the principal state;
scale (float, 1.0) – scaling factor for the reward function;
random_init (bool, False) – start from a random state;
episodic (bool, False) – end the episode when the state goes over the threshold;
gamma (float, .9) – discount factor;
horizon (int, 50) – horizon of the env.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

Minigrid

class MiniGridBase(name, horizon, gamma, fixed_seed, headless, wrappers, obs_high)[source]

Bases: Gymnasium

__init__(name, horizon, gamma, fixed_seed, headless, wrappers, obs_high)[source]

Constructor.

Parameters:

name (str) – gymnasium id of the environment;
horizon (int) – the horizon. If None, use the one from Gymnasium;
gamma (float, 0.99) – the discount factor;
headless (bool, False) – If True, the rendering is forced to be headless.
wrappers – list of wrappers to apply over the environment. It is possible to pass arguments to the wrappers by providing a tuple with two elements: the gym wrapper class and a dictionary containing the parameters needed by the wrapper constructor;

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

class MiniGrid(name, horizon=None, gamma=0.99, fixed_seed=None, headless=False)[source]

Bases: MiniGridBase

Interface for MiniGrid environments using the symbolic 7x7x3 observation. Each cell is encoded as (object_type, color, state), returned as a (3, H, W) array. Suitable for environments where color and door state matter.

__init__(name, horizon=None, gamma=0.99, fixed_seed=None, headless=False)[source]

Constructor.

Parameters:

name (str) – name of the environment;
horizon (int, None) – the horizon;
gamma (float, 0.99) – the discount factor;
fixed_seed (int, None) – if passed, fixes the seed at every reset;
headless (bool, False) – if True, the rendering is forced to be headless.

class MiniGridRGB(name, horizon=None, gamma=0.99, fixed_seed=None, headless=False)[source]

Bases: MiniGridBase

Interface for MiniGrid environments using pixel observations. The 56x56x3 RGB partial observation is converted to grayscale, suitable for CNN-based agents such as DQN.

__init__(name, horizon=None, gamma=0.99, fixed_seed=None, headless=False)[source]

Constructor.

Parameters:

name (str) – name of the environment;
horizon (int, None) – the horizon;
gamma (float, 0.99) – the discount factor;
fixed_seed (int, None) – if passed, fixes the seed at every reset;
headless (bool, False) – if True, the rendering is forced to be headless.

Mujoco

class MuJoCo(xml_file, actuation_spec, observation_spec, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, max_joint_vel=None, **viewer_params)[source]

Bases: Environment

Class to create a Mushroom environment using the MuJoCo simulator.

__init__(xml_file, actuation_spec, observation_spec, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, max_joint_vel=None, **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

seed(seed)[source]

Set the seed of the environment.

Parameters:: seed (float) – the value of the seed.

reset(obs=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:: action (np.ndarray) – numpy array with the actions provided to the environment.
Returns:: The action to be used for the current step

_step_init(obs, action)[source]: Allows information to be initialized at the start of a step.

_compute_action(obs, action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

obs (np.ndarray) – numpy array with the current state of teh simulation;
action (np.ndarray) – numpy array with the actions, provided at every step.

Returns:

The action to be set in the actual pybullet simulation.

_simulation_pre_step()[source]

Allows information to be accesed and changed at every intermediate step before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.

ex: apply a force over X to the torso: force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.

_step_finalize()[source]: Allows information to be accesed at the end of a step.

_read_data(name)[source]

Read data form the MuJoCo data structure.

Parameters:: name (string) – A name referring to an entry contained the additional_data_spec list handed to the constructor.
Returns:: The desired data as a one-dimensional numpy array.

_write_data(name, value)[source]

Write data to the MuJoCo data structure.

Parameters:

name (string) – A name referring to an entry contained in the additional_data_spec list handed to the constructor;
value (ndarray) – The data that should be written.

_check_collision(group1, group2)[source]

Check for collision between the specified groups.

Parameters:

group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.

Returns:

A flag indicating whether a collision occurred between the given groups or not.

_get_collision_force(group1, group2)[source]

Returns the collision force and torques between the specified groups.

Parameters:

group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.

Returns:

A 6D vector specifying the collision forces/torques[3D force + 3D torque] between the given groups. Vector of 0’s in case there was no collision. http://mujoco.org/book/programming.html#siContact

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

get_all_observation_keys()[source]

A function that returns all observation keys defined in the observation specification.

Returns:: A list of observation keys.

static get_action_indices(model, data, actuation_spec)[source]

Returns the action indices given the MuJoCo model, data, and actuation_spec.

Parameters:

model – MuJoCo model.
data – MuJoCo data structure.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;

Returns:

A list of actuator indices.

static get_action_space(action_indices, model)[source]

Returns the action space bounding box given the action_indices and the model.

Args:
action_indices (list): A list of actuator indices. model: MuJoCo model.

Returns:
A bounding box for the action space.

static user_warning_raise_exception(warning)[source]

Detects warnings in Mujoco and raises the respective exception.

Parameters:: warning – Mujoco warning.

static load_model(xml_file)[source]

Takes an xml_file and compiles and loads the model.

Parameters:: xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
Returns:: Mujoco model.

class MultiMuJoCo(xml_files, actuation_spec, observation_spec, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, max_joint_vel=None, random_env_reset=True, **viewer_params)[source]

Bases: MuJoCo

Class to create N environments at the same time using the MuJoCo simulator. This class is not meant to run N environments in parallel, but to load and create N environments, and randomly sample one of the environment every episode.

__init__(xml_files, actuation_spec, observation_spec, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, max_joint_vel=None, random_env_reset=True, **viewer_params)[source]

Constructor.

Parameters:

xml_files (str/xml handle) – A list containing strings with a path to the xml or Mujoco xml handles; actuation_spec (list): A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type);
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to
modify (n_substeps but allows the user to) – additional_data_spec (list, None): A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
states. (control and access intermediate) – additional_data_spec (list, None): A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco.
random_env_reset (bool) – If True, a random environment/model is chosen after each episode. If False, it is
list. (sequentially iterated through the environment/model)
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

reset(obs=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

static _get_env_id_map(current_model_idx, n_models)[source]

Retuns a binary vector to identify environment. This can be passed to the observation space.

Parameters:

current_model_idx (int) – index of the current model.
n_models (int) – total number of models.

Returns:

ndarray containing a binary vector identifying the current environment.

Air Hockey

class AirHockeyBase(n_agents=1, env_noise=False, obs_noise=False, gamma=0.99, horizon=500, timestep=0.004166666666666667, n_substeps=1, n_intermediate_steps=1, default_camera_mode='top_static', **viewer_params)[source]

Bases: MuJoCo

Abstract class for all AirHockey Environments.

__init__(n_agents=1, env_noise=False, obs_noise=False, gamma=0.99, horizon=500, timestep=0.004166666666666667, n_substeps=1, n_intermediate_steps=1, default_camera_mode='top_static', **viewer_params)[source]

Constructor.

Parameters:

n_agents (int, 1) – number of agent to be used in the environment (one or two)
env_noise (bool, False) – if True, the environment uses noisy dynamics.
obs_noise (bool, False) – if True, the environment uses noisy observations.

_simulation_pre_step()[source]

Allows information to be accesed and changed at every intermediate step before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.

ex: apply a force over X to the torso: force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

class AirHockeySingle(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeyBase

Base class for single agent air hockey tasks.

__init__(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]: Constructor.

get_puck(obs)[source]

Getting the puck properties from the observations :param obs: The current observation

Returns:: ([pos_x, pos_y], [lin_vel_x, lin_vel_y], ang_vel_z)

get_ee()[source]

Getting the ee properties from the current internal state

Returns:: ([pos_x, pos_y, pos_z], [ang_vel_x, ang_vel_y, ang_vel_z, lin_vel_x, lin_vel_y, lin_vel_z])

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class AirHockeyDouble(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeyBase

Base class for two agents air hockey tasks.

__init__(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]: Constructor.

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class AirHockeyHit(random_init=False, action_penalty=0.001, init_robot_state='right', gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeySingle

Class for the air hockey hitting task. The agent tries to get close to the puck if the hitting does not happen. And will get bonus reward if the robot scores a goal.

__init__(random_init=False, action_penalty=0.001, init_robot_state='right', gamma=0.99, horizon=120, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position.
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_robot_state (string, "right") – The configuration in which the robot is initialized. “right”, “left”, “random” available.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

class AirHockeyDefend(random_init=False, action_penalty=0.001, init_velocity_range=(1, 2.2), gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeySingle

Class for the air hockey defending task. The agent tries to stop the puck at the line x=-0.6. If the puck get into the goal, it will get a punishment.

__init__(random_init=False, action_penalty=0.001, init_velocity_range=(1, 2.2), gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_velocity_range ((float, float), (1, 2.2)) – The range in which the initial velocity is initialized

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

class AirHockeyPrepare(random_init=False, action_penalty=0.001, sub_problem='side', gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeySingle

Class for the air hockey preparation task. The agent tries to improve the puck position to y = 0. If the agent looses control of the puck, it will get a punishment.

__init__(random_init=False, action_penalty=0.001, sub_problem='side', gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
sub_problem (string, "side") – determines which area is considered for the initial puck position. Currently “side” and “bottom” are available.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

class AirHockeyRepel(random_init=False, action_penalty=0.001, init_velocity_range=(1, 2.2), gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Bases: AirHockeySingle

Class for the air hockey repel task. The agent tries repel the puck to the opponent. If the puck get into the goal, it will get a punishment.

__init__(random_init=False, action_penalty=0.001, init_velocity_range=(1, 2.2), gamma=0.99, horizon=500, env_noise=False, obs_noise=False, timestep=0.004166666666666667, n_intermediate_steps=1, **viewer_params)[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_velocity_range ((float, float), (1, 2.2)) – The range in which the initial velocity is initialized

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

Ball In A Cup

class BallInACup[source]

Bases: MuJoCo

Mujoco simulation of Ball In A Cup task, using Barret WAM robot.

__init__()[source]: Constructor.

reward(cur_obs, action, obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

Locomotion

class Ant(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.5, contact_cost_weight=0.0005, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_z_range=(0.2, 1.0), contact_force_range=(-1.0, 1.0), reset_noise_scale=0.1, n_substeps=5, exclude_current_positions_from_observation=True, use_contact_forces=False, **viewer_params)[source]

Bases: MuJoCo

The Ant MuJoCo environment as presented in: “High-Dimensional Continuous Control Using Generalized Advantage Estimation”. John Schulman et. al.. 2015. and implemented in Gymnasium

__init__(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.5, contact_cost_weight=0.0005, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_z_range=(0.2, 1.0), contact_force_range=(-1.0, 1.0), reset_noise_scale=0.1, n_substeps=5, exclude_current_positions_from_observation=True, use_contact_forces=False, **viewer_params)[source]: Constructor.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

get_states()[source]: Return the position and velocity joint states of the model

class HalfCheetah(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.1, reset_noise_scale=0.1, n_substeps=5, exclude_current_positions_from_observation=True, **viewer_params)[source]

Bases: MuJoCo

The HalfCheetah MuJoCo environment as presented in: “A Cat-Like Robot Real-Time Learning to Run”. Pawel Wawrzynski. 2009.

__init__(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.1, reset_noise_scale=0.1, n_substeps=5, exclude_current_positions_from_observation=True, **viewer_params)[source]: Constructor.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class Hopper(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.001, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_state_range=(-100.0, 100.0), healthy_z_range=(0.7, inf), healthy_angle_range=(-0.2, 0.2), reset_noise_scale=0.005, n_substeps=4, exclude_current_positions_from_observation=True, **viewer_params)[source]

Bases: MuJoCo

The Hopper MuJoCo environment as presented in: “Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts”. Tom Erez et. al.. 2012.

__init__(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.001, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_state_range=(-100.0, 100.0), healthy_z_range=(0.7, inf), healthy_angle_range=(-0.2, 0.2), reset_noise_scale=0.005, n_substeps=4, exclude_current_positions_from_observation=True, **viewer_params)[source]: Constructor.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_is_within_state_range()[source]: Check if state variables are within the healthy range.

_is_within_z_range(obs)[source]: Check if Z position of torso is within the healthy range.

_is_within_angle_range(obs)[source]: Check if Y angle of torso is within the healthy range.

_is_healthy(obs)[source]: Check if the agent is healthy.

is_absorbing(obs)[source]: Return True if the agent is unhealthy and terminate_when_unhealthy is True.

_get_healthy_reward(obs)[source]: Return the healthy reward if the agent is healthy, else 0.

_get_ctrl_cost(action)[source]: Return the control cost.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

get_states()[source]: Return the position and velocity joint states of the model

class Walker2D(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.001, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_z_range=(0.8, 2.0), healthy_angle_range=(-1.0, 1.0), reset_noise_scale=0.005, exclude_current_positions_from_observation=True, n_substeps=4, **viewer_params)[source]

Bases: MuJoCo

Mujoco simulation of Walker2d task based on the Hopper environment.

__init__(gamma=0.99, horizon=1000, forward_reward_weight=1.0, ctrl_cost_weight=0.001, healthy_reward=1.0, terminate_when_unhealthy=True, healthy_z_range=(0.8, 2.0), healthy_angle_range=(-1.0, 1.0), reset_noise_scale=0.005, exclude_current_positions_from_observation=True, n_substeps=4, **viewer_params)[source]: Constructor.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

_is_within_z_range(obs)[source]: Check if Z position of torso is within the healthy range.

_is_within_angle_range(obs)[source]: Check if y-angle of torso is within the healthy range.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

_get_healthy_reward(obs)[source]: Return the healthy reward if the agent is healthy, else 0.

_get_ctrl_cost(action)[source]: Return the control cost.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

Manipulation

class Panda(xml_path, gamma, horizon, n_substeps, additional_data_spec=None, collision_groups=None, actuation_spec=None, keyframe='home', **viewer_params)[source]

Bases: MuJoCo

__init__(xml_path, gamma, horizon, n_substeps, additional_data_spec=None, collision_groups=None, actuation_spec=None, keyframe='home', **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_simulation_pre_step()[source]

Allows information to be accesed and changed at every intermediate step before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.

ex: apply a force over X to the torso: force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque

class Reach(gamma=0.99, horizon=200, gripper_goal_distance_reward_weight=-2.0, gripper_goal_rotation_reward_weight=-1.0, ctrl_cost_weight=-0.0001, n_substeps=5, **viewer_params)[source]

Bases: Panda

__init__(gamma=0.99, horizon=200, gripper_goal_distance_reward_weight=-2.0, gripper_goal_rotation_reward_weight=-1.0, ctrl_cost_weight=-0.0001, n_substeps=5, **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class Pick(gamma=0.99, horizon=200, gripper_cube_distance_reward_weight=1.0, cube_goal_distance_reward_weight=20.0, cube_goal_rotation_reward_weight=10.0, ctrl_cost_weight=-0.0001, contact_cost_weight=-0.0001, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Bases: Panda

__init__(gamma=0.99, horizon=200, gripper_cube_distance_reward_weight=1.0, cube_goal_distance_reward_weight=20.0, cube_goal_rotation_reward_weight=10.0, ctrl_cost_weight=-0.0001, contact_cost_weight=-0.0001, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class Push(gamma=0.99, horizon=200, gripper_cube_distance_reward_weight=-1.0, cube_goal_distance_reward_weight=-2.0, ctrl_cost_weight=-0.0001, contact_cost_weight=-0.0001, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Bases: Panda

__init__(gamma=0.99, horizon=200, gripper_cube_distance_reward_weight=-1.0, cube_goal_distance_reward_weight=-2.0, ctrl_cost_weight=-0.0001, contact_cost_weight=-0.0001, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

class PegInsertion(gamma=0.99, horizon=300, alignment_reward_weight=1.0, insertion_reward_weight=15.0, rotation_reward_weight=2.0, ctrl_cost_weight=-0.0001, contact_cost_weight=0, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Bases: Panda

__init__(gamma=0.99, horizon=300, alignment_reward_weight=1.0, insertion_reward_weight=15.0, rotation_reward_weight=2.0, ctrl_cost_weight=-0.0001, contact_cost_weight=0, n_substeps=5, contact_force_range=(-1.0, 1.0), **viewer_params)[source]

Constructor.

Parameters:

xml_file (str/xml handle) – A string with a path to the xml or an Mujoco xml handle.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a key, which is used to access the data. An entry in the list is given by: (key, name, type). The name can later be used to retrieve specific observations;
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float) – The timestep used by the MuJoCo simulator. If None, the default timestep specified in the XML will be used;
n_substeps (int, 1) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
n_intermediate_steps (int, 1) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
collision_groups (list, None) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
max_joint_vel (list, None) – A list with the maximum joint velocities which are provided in the mdp_info. The list has to define a maximum velocity for every occurrence of JOINT_VEL in the observation_spec. The velocity will not be limited in mujoco
**viewer_params – other parameters to be passed to the viewer. See MujocoViewer documentation for the available options.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via obs_help.add_obs(self, name, o_type, length, min_value, max_value)

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The environment observation.

reward(obs, action, next_obs, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

obs (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_obs (np.array) – the state reached after applying the given action.
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(obs)[source]

Check whether the given state is an absorbing state or not.

Parameters:: obs (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs, action)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray) – the generated observation
Returns:: The information dictionary.

Puddle World

class PuddleWorld(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Bases: Environment

Puddle world as presented in: “Off-Policy Actor-Critic”. Degris T. et al.. 2012.

__init__(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Constructor.

Parameters:

start (np.array, None) – starting position of the agent;
goal (np.array, None) – goal position;
goal_threshold (float, .1) – distance threshold of the agent from the goal to consider it reached;
noise_step (float, .025) – noise in actions;
noise_reward (float, 0) – standard deviation of gaussian noise in reward;
reward_goal (float, 0) – reward obtained reaching goal state;
thrust (float, .05) – distance walked during each action;
puddle_center (np.array, None) – center of the puddle;
puddle_width (np.array, None) – width of the puddle;
gamma (float, .99) – discount factor.
horizon (int, 5000) – horizon of the problem;

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Pybullet

class PyBullet(files, actuation_spec, observation_spec, gamma, horizon, timestep=0.004166666666666667, n_intermediate_steps=1, enforce_joint_velocity_limits=False, debug_gui=False, **viewer_params)[source]

Bases: Environment

Class to create a Mushroom environment using the PyBullet simulator.

__init__(files, actuation_spec, observation_spec, gamma, horizon, timestep=0.004166666666666667, n_intermediate_steps=1, enforce_joint_velocity_limits=False, debug_gui=False, **viewer_params)[source]

Constructor.

Parameters:

files (dict) – dictionary of the URDF/MJCF/SDF files to load (key) and parameters dictionary (value);
actuation_spec (list) – A list of tuples specifying the names of the joints which should be controllable by the agent and their control mode. Can be left empty when all actuators should be used in position control;
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). An entry in the list is given by: (name, type);
gamma (float) – The discounting factor of the environment;
horizon (int) – The maximum horizon for the environment;
timestep (float, 0.00416666666) – The timestep used by the PyBullet simulator;
n_intermediate_steps (int) – The number of steps between every action taken by the agent. Allows the user to modify, control and access intermediate states;
enforce_joint_velocity_limits (bool, False) – flag to enforce the velocity limits;
debug_gui (bool, False) – flag to activate the default pybullet visualizer, that can be used for debug purposes;
**viewer_params – other parameters to be passed to the viewer. See PyBulletViewer documentation for the available options.

seed(seed)[source]

Set the seed of the environment.

Parameters:: seed (float) – the value of the seed.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

get_sim_state(obs, name, obs_type)[source]

Returns a specific observation value

Parameters:

obs (np.ndarray) – the observation vector;
name (str) – the name of the object to consider;
obs_type (PyBulletObservationType) – the type of observation to be used.

Returns:

The required elements of the input state vector.

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:: action (np.ndarray) – numpy array with the actions provided to the environment.
Returns:: The action to be used for the current step

_step_init(state, action)[source]: Allows information to be initialized at the start of a step.

_compute_action(state, action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

state (np.ndarray) – numpy array with the current state of teh simulation;
action (np.ndarray) – numpy array with the actions, provided at every step.

Returns:

The action to be set in the actual pybullet simulation.

_simulation_pre_step()[source]: Allows information to be accesed and changed at every intermediate step before taking a step in the pybullet simulation. Can be usefull to apply an external force/torque to the specified bodies.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the pybullet simulation. Can be usefull to average forces over all intermediate steps.

_step_finalize()[source]: Allows information to be accesed at the end of a step.

_custom_load_models()[source]

Allows to custom load a set of objects in the simulation

Returns:: A dictionary with the names and the ids of the loaded objects

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

state (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_state (np.array) – the state reached after applying the given action;
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

setup(state)[source]

A function that allows to execute setup code after an environment reset.

Parameters:

state (np.ndarray) – the state to be restored. If the state should be
environment (chosen by the)
this (state is None. Environments can ignore)
programmatically. (value if the initial state cannot be set)

Air Hockey

class AirHockeyBaseBullet(gamma=0.99, horizon=500, n_agents=1, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, table_boundary_terminate=False)[source]

Bases: PyBullet

Base class for air hockey environment. The environment is designed for 3 joints planar robot playing Air-Hockey

__init__(gamma=0.99, horizon=500, n_agents=1, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, table_boundary_terminate=False)[source]

Constructor.

Parameters:

gamma (float, 0.99) – discount factor;
horizon (int, 500) – horizon of the task;
n_agents (int, 1) – number of agents;
env_noise (bool, False) – If true, the puck’s movement is affected by the air-flow noise;
obs_noise (bool, False) – If true, the noise is added in the observation;
obs_delay (bool, False) – If true, velocity is observed by the low-pass filter;
control (bool, True) – If false, the robot in position control mode;
step_action_function (object, None) – A callable function to warp-up the policy action to environment command.
table_boundary_terminate (bool, False) – Episode terminates if the mallet is outside the boundary

_compute_action(state, action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

state (np.ndarray) – numpy array with the current state of teh simulation;
action (np.ndarray) – numpy array with the actions, provided at every step.

Returns:

The action to be set in the actual pybullet simulation.

_simulation_pre_step()[source]: Allows information to be accesed and changed at every intermediate step before taking a step in the pybullet simulation. Can be usefull to apply an external force/torque to the specified bodies.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

class AirHockeySingleBullet(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, table_boundary_terminate=False, number_flags=0)[source]

Bases: AirHockeyBaseBullet

Base class for single agent air hockey tasks.

__init__(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, table_boundary_terminate=False, number_flags=0)[source]

Constructor.

Parameters:: number_flags (int, 0) – Amount of flags which are added to the observation space

_modify_mdp_info(mdp_info)[source]: puck position indexes: [0, 1] puck velocity indexes: [7, 8, 9] joint position indexes: [13, 14, 15] joint velocity indexes: [16, 17, 18]

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

class AirHockeyHitBullet(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_robot_state='right')[source]

Bases: AirHockeySingleBullet

Class for the air hockey hitting task. The agent tries to get close to the puck if the hitting does not happen. And will get bonus reward if the robot scores a goal.

__init__(gamma=0.99, horizon=120, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_robot_state='right')[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position.
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_robot_state (string, "right") – The configuration in which the robot is initialized. “right”, “left”, “random” available

setup(state)[source]

A function that allows to execute setup code after an environment reset.

Parameters:

state (np.ndarray) – the state to be restored. If the state should be
environment (chosen by the)
this (state is None. Environments can ignore)
programmatically. (value if the initial state cannot be set)

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

state (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_state (np.array) – the state reached after applying the given action;
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the pybullet simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

class AirHockeyDefendBullet(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_velocity_range=(1, 2.2))[source]

Bases: AirHockeySingleBullet

Class for the air hockey defending task. The agent tries to stop the puck at the line x=-0.6. If the puck get into the goal, it will get a punishment.

__init__(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_velocity_range=(1, 2.2))[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_velocity_range ((float, float), (1, 2.2)) – The range in which the initial velocity is initialized

setup(state=None)[source]

A function that allows to execute setup code after an environment reset.

Parameters:

state (np.ndarray) – the state to be restored. If the state should be
environment (chosen by the)
this (state is None. Environments can ignore)
programmatically. (value if the initial state cannot be set)

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

state (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_state (np.array) – the state reached after applying the given action;
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the pybullet simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

class AirHockeyPrepareBullet(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, sub_problem='side')[source]

Bases: AirHockeySingleBullet

Class for the air hockey preparation task. The agent tries to improve the puck position to y = 0. If the agent looses control of the puck, it will get a punishment.

__init__(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, sub_problem='side')[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
sub_problem (string, "side") – determines which area is considered for the initial puck position. Currently “side” and “bottom” are available

setup(state)[source]

A function that allows to execute setup code after an environment reset.

Parameters:

state (np.ndarray) – the state to be restored. If the state should be
environment (chosen by the)
this (state is None. Environments can ignore)
programmatically. (value if the initial state cannot be set)

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

state (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_state (np.array) – the state reached after applying the given action;
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the pybullet simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

class AirHockeyRepelBullet(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_velocity_range=(1, 2.2))[source]

Bases: AirHockeySingleBullet

Class for the air hockey repel task. The agent tries repel the puck to the opponent. If the puck get into the goal, it will get a punishment.

__init__(gamma=0.99, horizon=500, env_noise=False, obs_noise=False, obs_delay=False, torque_control=True, step_action_function=None, timestep=0.004166666666666667, n_intermediate_steps=1, debug_gui=False, random_init=False, action_penalty=0.001, table_boundary_terminate=False, init_velocity_range=(1, 2.2))[source]

Constructor

Parameters:

random_init (bool, False) – If true, initialize the puck at random position .
action_penalty (float, 1e-3) – The penalty of the action on the reward at each time step
init_velocity_range ((float, float), (1, 2.2)) – The range in which the initial velocity is initialized

setup(state=None)[source]

A function that allows to execute setup code after an environment reset.

Parameters:

state (np.ndarray) – the state to be restored. If the state should be
environment (chosen by the)
this (state is None. Environments can ignore)
programmatically. (value if the initial state cannot be set)

reward(state, action, next_state, absorbing)[source]

Compute the reward based on the given transition.

Parameters:

state (np.array) – the current state of the system;
action (np.array) – the action that is applied in the current state;
next_state (np.array) – the state reached after applying the given action;
absorbing (bool) – whether next_state is an absorbing state or not.

Returns:

The reward as a floating point scalar value.

is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:: state (np.array) – the state of the system.
Returns:: A boolean flag indicating whether this state is absorbing or not.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the pybullet simulation. Can be usefull to average forces over all intermediate steps.

_create_observation(state)[source]

This method can be overridden to ctreate an observation vector from the simulator state vector. By default, returns the simulator state vector unchanged.

Parameters:: state (np.ndarray) – the simulator state vector.
Returns:: The environment observation.

Segway

class Segway(random_start=False)[source]

Bases: Environment

The Segway environment (continuous version) as presented in: “Deep Learning for Actor-Critic Reinforcement Learning”. Xueli Jia. 2015.

__init__(random_start=False)[source]

Constructor.

Parameters:: random_start (bool, False) – whether to start from a random position or from the horizontal one.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Ship steering

class ShipSteering(small=True, n_steps_action=3)[source]

Bases: Environment

The Ship Steering environment as presented in: “Hierarchical Policy Gradient Algorithms”. Ghavamzadeh M. and Mahadevan S.. 2013.

__init__(small=True, n_steps_action=3)[source]

Constructor.

Parameters:

small (bool, True) – whether to use a small state space or not.
n_steps_action (int, 3) – number of integration intervals for each step of the env.

reset(state=None)[source]

Reset the environment to the initial state.

Parameters:: state (np.ndarray, None) – the state to set to the current state.
Returns:: The initial state and a dictionary containing the info for the episode.

step(action)[source]

Move the agent from its current state according to the action.

Parameters:: action (np.ndarray) – the action to execute.
Returns:: The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also, an additional dictionary is returned (possibly empty).

render(record=False)[source]

Render the environment to screen.

Parameters:: record (bool, False) – whether the visualized image should be returned or not.
Returns:: The visualized image, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Isaac Sim

class IsaacSim(usd_path, actuation_spec, observation_spec, backend, device, collision_between_envs, num_envs, env_spacing, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, action_type=ActionType.EFFORT, physics_material_spec=None, sim_params=None, solver_pos_it_count=None, solver_vel_it_count=None, ground_plane_friction=None, headless=True, camera_position=(5, 0, 4), camera_target=(0, 0, 0), render_product_size=(1280, 720))[source]

Bases: VectorizedEnvironment

Class to create a Mushroom environment using the Isaac Sim simulator.

__init__(usd_path, actuation_spec, observation_spec, backend, device, collision_between_envs, num_envs, env_spacing, gamma, horizon, timestep=None, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None, action_type=ActionType.EFFORT, physics_material_spec=None, sim_params=None, solver_pos_it_count=None, solver_vel_it_count=None, ground_plane_friction=None, headless=True, camera_position=(5, 0, 4), camera_target=(0, 0, 0), render_product_size=(1280, 720))[source]

Constructor.

Parameters:

usd_path (str) – Path to usd file of the robot.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent.
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a path, which is used to access the prim, and a list or a single string with name of the subelements of prim which should be accessed. For example a subbody or a joint of an ArticulationView. An entry in the list is given by: (key, name, type, element). The name can later be used to retrieve specific observations.
backend (str) – Backend for array operations.
device (str) – Compute device (e.g., ‘cuda:0’).
collision_between_envs (bool) – Whether inter-environment collisions are allowed.
num_envs (int) – Number of parallel environments.
env_spacing (float) – Distance between environments.
gamma (float) – The discounting factor of the environment.
horizon (int) – The maximum horizon for the environment.
timestep (float, None) – Simulation timestep.
n_substeps (int, None) – Number of substeps per simulation step. If None default timestep of isaac sim is used
n_intermediate_steps (int) – Number of intermediate control steps. Defaults to 1.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, path, type) key is a string for later referencing in the “read_data” and “write_data” methods.
collision_groups (dict, None) – A list containing groups of prims for which collisions should be checked during simulation. The entries are given as (key, prim_paths), where key is a string for later reference and prim_paths is a list of paths to the prims.
action_type (ActionType) – Control type of the joints (effort, position, velocity).
physics_material_spec (list, None) – A list containing all data to create a custom physics material for each environment, which will be applied to all rigidbodies. The entries are given as the following tuples: (name, dynamic_friction, static_friction, restitution)
sim_params (dict) – Dictionary of simulation parameters for the physics context. Intended to set gpu_collision_stack_size, gpu_found_lost_aggregate_pairs_capacity, gpu_found_lost_pairs_capacity, gpu_heap_capacity, gpu_max_num_partitions, gpu_max_particle_contacts, gpu_max_rigid_contact_count, gpu_max_rigid_patch_count, gpu_max_soft_body_contacts, gpu_temp_buffer_capacity, gpu_total_aggregate_pairs_capacity.
solver_pos_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement;
solver_vel_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement
ground_plane_friction (tuple, None) – A tuple containing the static friciton, dynamic friction and restitution for the groundplane. The tuple should have the following format: (static_friction, dynamic_friction, restitution)
camera_position (tuple) – The position where the camera is placed.
camera_target (tuple) – The position the camera is aimed at.
headless (bool) – Whether to run in headless mode.
render_product_size (tuple) – (Width, Height) of the recorded and displayed image.

step_all(env_mask, action)[source]

Performs a simulation step for all active environments.

Parameters:

env_mask (torch.tensor, np.ndarray) – A boolean mask indicating which environments are active for this step.
action (torch.tensor, np.ndarray) – The actions to be applied to the active environments.

Returns:

The updated observations after the step. reward (torch.tensor, np.ndarray): The computed rewards for each environment. absorbing (torch.tensor, np.ndarray): A boolean tensor indicating if an environment is in an absorbing state. extra_info (dict): Additional information about the simulation step.

Return type:

cur_obs (torch.tensor, np.ndarray)

reset_all(env_mask, state=None)[source]

Resets the specified environments and initializes their states.

Parameters:

env_mask (torch.tensor, np.ndarray) – A boolean mask indicating which environments should be reset.
state (torch.tensor, np.ndarray) – The initial state to set for the reset environments. Defaults to None.

Returns:

The observations after resetting the environments. info (dict): Additional information about the reset environments.

Return type:

obs (torch.tensor, np.ndarray)

render_all(env_mask, record=False)[source]

Render all environments. Optionally record the frames.

Parameters:: record (bool, False) – If True, the function returns the rendered image data.

reward(obs, action, next_obs, absorbing)[source]

Compute the rewards based on the given transitions.

Parameters:

obs (torch.tensor, np.array) – the current states of the parallel environments.
action (torch.tensor, np.array) – the actions that are applied in the current states.
next_obs (torch.tensor, np.array) – the states reached after applying the given actions.
absorbing (torch.tensor, np.array) – whether next_state is an absorbing state or not.

Returns:

The rewards as an array or tensor.

is_absorbing(obs)[source]

Check whether the given states are an absorbing states or not.

Parameters:: obs (torch.tensor, np.array) – the states of the parallel environments.
Returns:: A tensor of booleans indicating whether the corresponding states are absorbing or not.

setup(env_indices, obs)[source]: A function that allows to execute setup code after an environment reset.

seed(seed=-1, torch_deterministic=False)[source]

Sets the random seed for a deterministic behavior.

Parameters:: seed (int, optional) – The seed value to set. If -1, a random seed is used. Defaults to -1.
Returns:: The seed value that was set.
Return type:: int

stop(soft=True)[source]

Resets simulation and closes viewer.

If soft is False, the function additionally clears consistent properties from the task before resetting the simulation.

Parameters:: soft (bool) – Defaults to True. - True: Performs soft reset of world. - False: Perform reset of world and clears consistent properties.

cleanup()[source]: Ends simulation.

_read_data(name, env_indices=None)[source]

Read data from isaac sim.

Parameters:: name (str) – A name referring to an entry contained in additional_data_spec or observation_spec.
Returns:: The desired data as a tensor or array.

_write_data(name, value, env_indices=None, reapply_after_reset=False)[source]

Writes data to isaac sim.

Parameters:

name (str) – A name referring to an entry contained in additional_data_spec or observation_spec.
value (torch.tensor, np.ndarra) – The data that should be written.
reapply_after_reset (bool) – Whether the written property should be reapplied after a world reset. Defaults to False.

_create_simulation_app(headless)[source]: Starts IsaacSim.

_apply_carb_settings()[source]: Apply settings for optimization.

_create_world(timestep, custom_sim_params=None)[source]

Create and configure the simulation world.

Parameters:

timestep (float, None) – The physics timestep. the default physics timestep is used.
custom_sim_params (dict, None) – A dictionary of simulation parameters to override the default ones.

_set_task(usd_path, num_envs, env_spacing, collision_between_envs, observation_spec, actuation_spec, additional_data_spec, collision_groups, physics_material_spec, camera_position, camera_target, solver_pos_it_count, solver_vel_it_count, ground_plane_friction)[source]: Set up the simulation task.

_check_collision(group1, group2, threshold=0.0, selector=None, dt=1.0)[source]

Checks whether the collision force between two collision groups exceeds a given threshold.

Parameters:

group1 (str) – The name of the first collision group.
group2 (str) – The name of the second collision group.
threshold (float, torch.tensor, np.ndarray) – The threshold value to compare against. Can be a scalar or a tensor/array of the same shape as the computed forces.
selector (Callable[[torch.tensor | np.ndarray], torch.tensor | np.ndarray], optional) – A function that processes the collision force tensor or array. If None, a default selector is used. Defaults to None.
dt (float, optional) – The time step duration used for computing forces. The function uses impulses if the default dt is used

Returns:

A boolean tensor or array indicating where the computed forces exceed the given threshold.

_get_collision_force(group1, group2, selector=None, dt=1.0)[source]

Computes the collision forces or impulses between two collision groups.

Parameters:

group1 (str) – The name of the first collision group.
group2 (str) – The name of the second collision group.
selector (Callable[[torch.tensor | np.ndarray], torch.Tensor | np.ndarray], optional) – A function that processes the collision force tensor. Defaults to selecting the maximum force of each environment
dt (float, optional) – The time step duration used for computing forces. The function returns contact impulses if the default dt is used

Returns:

A tensor or array containing the computed collision forces between the groups, processed by the selector function.

_get_collision_count(group1, group2, threshold=0.0, selector=None, dt=1.0)[source]

Counts the number of collisions between two groups, considering at most one collision between each possible pair of objects from the two groups. For example, the maximum collision count between a group with 3 objects and a group with 4 objects would be 12.

Parameters:

group1 (str) – The name of the first collision group.
group2 (str) – The name of the second collision group.
threshold (float, torch.tensor, np.ndarray) – The threshold value for detecting collisions. Can be a scalar or a tensor/array of the same shape as the computed forces.
selector (Callable[[torch.tensor | np.ndarray], torch.tensor | np.ndarray], optional) – A function that processes the collision force tensor or array. If None, the default selector computes the norm along the collision dimension. Defaults to None.
dt (float, optional) – The time step duration used for computing forces. Defaults to 1.0.

Returns:

A tensor or array containing the count of collisions

_get_net_collision_forces(group, dt=1.0)[source]

Returns the net contact forces for all objects in a collision group (simliar to a pressure sensor)

Parameters:

group (str) – The name of the collision group.
dt (float, optional) – The time step duration used for computing forces. Defaults to 1.0.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via observation_helper.add_obs(self, name, length, min_value, max_value)

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_compute_action(action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

obs (np.ndarray, torch.tensor) – current state of the simulation;
action (np.ndarray, torch.tensor) – the actions, provided at every step.

Returns:

The action to be applied in the isaac sim simulation

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:: action (np.ndarray, torch.tensor) – the actions provided to the environment.
Returns:: The action to be used for the current step

_modify_mdp_info(mdp_info)[source]

This method can be overridden to modify the automatically generated MDPInfo data structure. By default, returns the given mdp_info structure unchanged.

Parameters:: mdp_info (MDPInfo) – the MDPInfo structure automatically computed by the environment.
Returns:: The modified MDPInfo data structure.

_create_info_dictionary(obs)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The information dictionary.

_simulation_pre_step()[source]: Allows information to be accesed and changed at every intermediate step before taking a step in the isaac sim simulation. Can be usefull to apply an external force/torque to the specified bodies.

_simulation_post_step()[source]: Allows information to be accesed at every intermediate step after taking a step in the isaac sim simulation. Can be usefull to average forces over all intermediate steps.

_step_finalize(env_indices)[source]: Allows information to be accesed at the end of a step.

class A1Walking(num_envs, horizon, headless, domain_randomization=True, camera_position=(105, 0, 4), camera_target=(95, 0, 0))[source]

Bases: IsaacSim

A learning environment for training the A1 quadroped to walk.

Resembles environment implemented by Rudin et al. for “Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning”

__init__(num_envs, horizon, headless, domain_randomization=True, camera_position=(105, 0, 4), camera_target=(95, 0, 0))[source]

Constructor.

Parameters:

usd_path (str) – Path to usd file of the robot.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent.
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a path, which is used to access the prim, and a list or a single string with name of the subelements of prim which should be accessed. For example a subbody or a joint of an ArticulationView. An entry in the list is given by: (key, name, type, element). The name can later be used to retrieve specific observations.
backend (str) – Backend for array operations.
device (str) – Compute device (e.g., ‘cuda:0’).
collision_between_envs (bool) – Whether inter-environment collisions are allowed.
num_envs (int) – Number of parallel environments.
env_spacing (float) – Distance between environments.
gamma (float) – The discounting factor of the environment.
horizon (int) – The maximum horizon for the environment.
timestep (float, None) – Simulation timestep.
n_substeps (int, None) – Number of substeps per simulation step. If None default timestep of isaac sim is used
n_intermediate_steps (int) – Number of intermediate control steps. Defaults to 1.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, path, type) key is a string for later referencing in the “read_data” and “write_data” methods.
collision_groups (dict, None) – A list containing groups of prims for which collisions should be checked during simulation. The entries are given as (key, prim_paths), where key is a string for later reference and prim_paths is a list of paths to the prims.
action_type (ActionType) – Control type of the joints (effort, position, velocity).
physics_material_spec (list, None) – A list containing all data to create a custom physics material for each environment, which will be applied to all rigidbodies. The entries are given as the following tuples: (name, dynamic_friction, static_friction, restitution)
sim_params (dict) – Dictionary of simulation parameters for the physics context. Intended to set gpu_collision_stack_size, gpu_found_lost_aggregate_pairs_capacity, gpu_found_lost_pairs_capacity, gpu_heap_capacity, gpu_max_num_partitions, gpu_max_particle_contacts, gpu_max_rigid_contact_count, gpu_max_rigid_patch_count, gpu_max_soft_body_contacts, gpu_temp_buffer_capacity, gpu_total_aggregate_pairs_capacity.
solver_pos_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement;
solver_vel_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement
ground_plane_friction (tuple, None) – A tuple containing the static friciton, dynamic friction and restitution for the groundplane. The tuple should have the following format: (static_friction, dynamic_friction, restitution)
camera_position (tuple) – The position where the camera is placed.
camera_target (tuple) – The position the camera is aimed at.
headless (bool) – Whether to run in headless mode.
render_product_size (tuple) – (Width, Height) of the recorded and displayed image.

is_absorbing(obs)[source]

Check whether the given states are an absorbing states or not.

Parameters:: obs (torch.tensor, np.array) – the states of the parallel environments.
Returns:: A tensor of booleans indicating whether the corresponding states are absorbing or not.

setup(env_indices, obs)[source]: A function that allows to execute setup code after an environment reset.

_step_finalize(env_indices)[source]: Allows information to be accesed at the end of a step.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via observation_helper.add_obs(self, name, length, min_value, max_value)

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_create_info_dictionary(obs)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The information dictionary.

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:: action (np.ndarray, torch.tensor) – the actions provided to the environment.
Returns:: The action to be used for the current step

_compute_action(action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

obs (np.ndarray, torch.tensor) – current state of the simulation;
action (np.ndarray, torch.tensor) – the actions, provided at every step.

Returns:

The action to be applied in the isaac sim simulation

reward(obs, action, next_obs, absorbing)[source]

Compute the rewards based on the given transitions.

Parameters:

obs (torch.tensor, np.array) – the current states of the parallel environments.
action (torch.tensor, np.array) – the actions that are applied in the current states.
next_obs (torch.tensor, np.array) – the states reached after applying the given actions.
absorbing (torch.tensor, np.array) – whether next_state is an absorbing state or not.

Returns:

The rewards as an array or tensor.

class CartPole(num_envs, headless=True, backend='torch', device='cuda:0', camera_pos=(20, 0, 4), camera_target=(10, 0, 0))[source]

Bases: IsaacSim

__init__(num_envs, headless=True, backend='torch', device='cuda:0', camera_pos=(20, 0, 4), camera_target=(10, 0, 0))[source]

Constructor.

Parameters:

usd_path (str) – Path to usd file of the robot.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent.
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a path, which is used to access the prim, and a list or a single string with name of the subelements of prim which should be accessed. For example a subbody or a joint of an ArticulationView. An entry in the list is given by: (key, name, type, element). The name can later be used to retrieve specific observations.
backend (str) – Backend for array operations.
device (str) – Compute device (e.g., ‘cuda:0’).
collision_between_envs (bool) – Whether inter-environment collisions are allowed.
num_envs (int) – Number of parallel environments.
env_spacing (float) – Distance between environments.
gamma (float) – The discounting factor of the environment.
horizon (int) – The maximum horizon for the environment.
timestep (float, None) – Simulation timestep.
n_substeps (int, None) – Number of substeps per simulation step. If None default timestep of isaac sim is used
n_intermediate_steps (int) – Number of intermediate control steps. Defaults to 1.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, path, type) key is a string for later referencing in the “read_data” and “write_data” methods.
collision_groups (dict, None) – A list containing groups of prims for which collisions should be checked during simulation. The entries are given as (key, prim_paths), where key is a string for later reference and prim_paths is a list of paths to the prims.
action_type (ActionType) – Control type of the joints (effort, position, velocity).
physics_material_spec (list, None) – A list containing all data to create a custom physics material for each environment, which will be applied to all rigidbodies. The entries are given as the following tuples: (name, dynamic_friction, static_friction, restitution)
sim_params (dict) – Dictionary of simulation parameters for the physics context. Intended to set gpu_collision_stack_size, gpu_found_lost_aggregate_pairs_capacity, gpu_found_lost_pairs_capacity, gpu_heap_capacity, gpu_max_num_partitions, gpu_max_particle_contacts, gpu_max_rigid_contact_count, gpu_max_rigid_patch_count, gpu_max_soft_body_contacts, gpu_temp_buffer_capacity, gpu_total_aggregate_pairs_capacity.
solver_pos_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement;
solver_vel_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement
ground_plane_friction (tuple, None) – A tuple containing the static friciton, dynamic friction and restitution for the groundplane. The tuple should have the following format: (static_friction, dynamic_friction, restitution)
camera_position (tuple) – The position where the camera is placed.
camera_target (tuple) – The position the camera is aimed at.
headless (bool) – Whether to run in headless mode.
render_product_size (tuple) – (Width, Height) of the recorded and displayed image.

reward(obs, action, next_obs, absorbing)[source]

Compute the rewards based on the given transitions.

Parameters:

obs (torch.tensor, np.array) – the current states of the parallel environments.
action (torch.tensor, np.array) – the actions that are applied in the current states.
next_obs (torch.tensor, np.array) – the states reached after applying the given actions.
absorbing (torch.tensor, np.array) – whether next_state is an absorbing state or not.

Returns:

The rewards as an array or tensor.

is_absorbing(obs)[source]

Check whether the given states are an absorbing states or not.

Parameters:: obs (torch.tensor, np.array) – the states of the parallel environments.
Returns:: A tensor of booleans indicating whether the corresponding states are absorbing or not.

setup(env_indices, obs)[source]: A function that allows to execute setup code after an environment reset.

_create_info_dictionary(obs)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The information dictionary.

class HoneyBadgerWalking(num_envs, horizon, headless, domain_randomization=True, camera_pos=(105, 0, 4), camera_target=(95, 0, 0))[source]

Bases: IsaacSim

A learning environment for training the Honey Badger quadroped to walk. Honey Badger is a Robot from MAB Robotics: https://www.mabrobotics.pl/

__init__(num_envs, horizon, headless, domain_randomization=True, camera_pos=(105, 0, 4), camera_target=(95, 0, 0))[source]

Constructor.

Parameters:

usd_path (str) – Path to usd file of the robot.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent.
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a path, which is used to access the prim, and a list or a single string with name of the subelements of prim which should be accessed. For example a subbody or a joint of an ArticulationView. An entry in the list is given by: (key, name, type, element). The name can later be used to retrieve specific observations.
backend (str) – Backend for array operations.
device (str) – Compute device (e.g., ‘cuda:0’).
collision_between_envs (bool) – Whether inter-environment collisions are allowed.
num_envs (int) – Number of parallel environments.
env_spacing (float) – Distance between environments.
gamma (float) – The discounting factor of the environment.
horizon (int) – The maximum horizon for the environment.
timestep (float, None) – Simulation timestep.
n_substeps (int, None) – Number of substeps per simulation step. If None default timestep of isaac sim is used
n_intermediate_steps (int) – Number of intermediate control steps. Defaults to 1.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, path, type) key is a string for later referencing in the “read_data” and “write_data” methods.
collision_groups (dict, None) – A list containing groups of prims for which collisions should be checked during simulation. The entries are given as (key, prim_paths), where key is a string for later reference and prim_paths is a list of paths to the prims.
action_type (ActionType) – Control type of the joints (effort, position, velocity).
physics_material_spec (list, None) – A list containing all data to create a custom physics material for each environment, which will be applied to all rigidbodies. The entries are given as the following tuples: (name, dynamic_friction, static_friction, restitution)
sim_params (dict) – Dictionary of simulation parameters for the physics context. Intended to set gpu_collision_stack_size, gpu_found_lost_aggregate_pairs_capacity, gpu_found_lost_pairs_capacity, gpu_heap_capacity, gpu_max_num_partitions, gpu_max_particle_contacts, gpu_max_rigid_contact_count, gpu_max_rigid_patch_count, gpu_max_soft_body_contacts, gpu_temp_buffer_capacity, gpu_total_aggregate_pairs_capacity.
solver_pos_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement;
solver_vel_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement
ground_plane_friction (tuple, None) – A tuple containing the static friciton, dynamic friction and restitution for the groundplane. The tuple should have the following format: (static_friction, dynamic_friction, restitution)
camera_position (tuple) – The position where the camera is placed.
camera_target (tuple) – The position the camera is aimed at.
headless (bool) – Whether to run in headless mode.
render_product_size (tuple) – (Width, Height) of the recorded and displayed image.

is_absorbing(obs)[source]

Check whether the given states are an absorbing states or not.

Parameters:: obs (torch.tensor, np.array) – the states of the parallel environments.
Returns:: A tensor of booleans indicating whether the corresponding states are absorbing or not.

setup(env_indices, obs)[source]: A function that allows to execute setup code after an environment reset.

_step_finalize(env_indices)[source]: Allows information to be accesed at the end of a step.

_create_observation(obs)[source]

This method can be overridden to create a custom observation. Should be used to append observation which have been registered via observation_helper.add_obs(self, name, length, min_value, max_value)

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_modify_observation(obs)[source]

This method can be overridden to edit the created observation. This is done after the reward and absorbing functions are evaluated. Especially useful to transform the observation into different frames. If the original observation order is not preserved, the helper functions in ObervationHelper breaks.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The environment observation.

_create_info_dictionary(obs)[source]

This method can be overridden to create a custom info dictionary.

Parameters:: obs (np.ndarray, torch.tensor) – the generated observation
Returns:: The information dictionary.

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:: action (np.ndarray, torch.tensor) – the actions provided to the environment.
Returns:: The action to be used for the current step

_compute_action(action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:

obs (np.ndarray, torch.tensor) – current state of the simulation;
action (np.ndarray, torch.tensor) – the actions, provided at every step.

Returns:

The action to be applied in the isaac sim simulation

reward(obs, action, next_obs, absorbing)[source]

Compute the rewards based on the given transitions.

Parameters:

obs (torch.tensor, np.array) – the current states of the parallel environments.
action (torch.tensor, np.array) – the actions that are applied in the current states.
next_obs (torch.tensor, np.array) – the states reached after applying the given actions.
absorbing (torch.tensor, np.array) – whether next_state is an absorbing state or not.

Returns:

The rewards as an array or tensor.

class SilverBadgerWalking(num_envs, horizon, headless, domain_randomization=True, camera_pos=(105, 0, 4), camera_target=(95, 0, 0))[source]

Bases: HoneyBadgerWalking

A learning environment for training the Silver Badger quadroped to walk. Silver Badger is a Robot from MAB Robotics: https://www.mabrobotics.pl/

__init__(num_envs, horizon, headless, domain_randomization=True, camera_pos=(105, 0, 4), camera_target=(95, 0, 0))[source]

Constructor.

Parameters:

usd_path (str) – Path to usd file of the robot.
actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent.
observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). They are combined with a path, which is used to access the prim, and a list or a single string with name of the subelements of prim which should be accessed. For example a subbody or a joint of an ArticulationView. An entry in the list is given by: (key, name, type, element). The name can later be used to retrieve specific observations.
backend (str) – Backend for array operations.
device (str) – Compute device (e.g., ‘cuda:0’).
collision_between_envs (bool) – Whether inter-environment collisions are allowed.
num_envs (int) – Number of parallel environments.
env_spacing (float) – Distance between environments.
gamma (float) – The discounting factor of the environment.
horizon (int) – The maximum horizon for the environment.
timestep (float, None) – Simulation timestep.
n_substeps (int, None) – Number of substeps per simulation step. If None default timestep of isaac sim is used
n_intermediate_steps (int) – Number of intermediate control steps. Defaults to 1.
additional_data_spec (list, None) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, path, type) key is a string for later referencing in the “read_data” and “write_data” methods.
collision_groups (dict, None) – A list containing groups of prims for which collisions should be checked during simulation. The entries are given as (key, prim_paths), where key is a string for later reference and prim_paths is a list of paths to the prims.
action_type (ActionType) – Control type of the joints (effort, position, velocity).
physics_material_spec (list, None) – A list containing all data to create a custom physics material for each environment, which will be applied to all rigidbodies. The entries are given as the following tuples: (name, dynamic_friction, static_friction, restitution)
sim_params (dict) – Dictionary of simulation parameters for the physics context. Intended to set gpu_collision_stack_size, gpu_found_lost_aggregate_pairs_capacity, gpu_found_lost_pairs_capacity, gpu_heap_capacity, gpu_max_num_partitions, gpu_max_particle_contacts, gpu_max_rigid_contact_count, gpu_max_rigid_patch_count, gpu_max_soft_body_contacts, gpu_temp_buffer_capacity, gpu_total_aggregate_pairs_capacity.
solver_pos_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement;
solver_vel_it_count (torch, array) – An array with the same size as num_envs. Determines how accurately contacts, drives, and limits are resolved. Low values can lead to performance improvement
ground_plane_friction (tuple, None) – A tuple containing the static friciton, dynamic friction and restitution for the groundplane. The tuple should have the following format: (static_friction, dynamic_friction, restitution)
camera_position (tuple) – The position where the camera is placed.
camera_target (tuple) – The position the camera is aimed at.
headless (bool) – Whether to run in headless mode.
render_product_size (tuple) – (Width, Height) of the recorded and displayed image.

Omni Isaac Gym

class OmniIsaacGymEnv(cfg=None, headless=False, backend='torch')[source]

Bases: VectorizedEnvironment

Interface for OmniIsaacGymEnvs environments. It makes it possible to use every OmniIsaacGymEnvs environment just providing the task.

__init__(cfg=None, headless=False, backend='torch')[source]

Initializes RL and task parameters.

Parameters:

cfg (dict) – dictionary containing the parameters required to build the task;
headless (bool) – Whether to run training headless;
backend (str, 'torch') – The backend to be used by the environment.

seed(seed=-1)[source]

Set the seed of the environment.

Parameters:: seed (float) – the value of the seed.

reset_all(env_mask, state=None)[source]

Reset all the specified environments to the initial state.

Parameters:

env_mask – mask specifying which environments needs reset.
state – set of initial states to impose to the environment.

Returns:

The initial states of all environments and a listy of episode info dictionaries

step_all(env_mask, action)[source]

Move all the specified agents from their current state according to the actions.

Parameters:

env_mask – mask specifying which environments needs to do a step.
action – set of actions to execute.

Returns:

The initial states of all environments and a listy of step info dictionaries

render_all(env_mask, record=False)[source]

Render all the specified environments to screen.

Parameters:: record (bool, False) – whether the visualized images should be returned or not.
Returns:: The visualized images, or None if the record flag is set to false.

stop()[source]: Method used to stop an env. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Generators

Grid world

generate_grid_world(grid, prob, pos_rew, neg_rew, gamma=0.9, horizon=100)[source]

This Grid World generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting position where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘*’ is a hole, when the agent steps on a hole, it receives a negative reward and the episode ends; ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided.

The grid is expected to be rectangular.

Parameters:

grid (str) – the path of the file containing the grid structure;
prob (float) – probability of success of an action;
pos_rew (float) – reward obtained in goal states;
neg_rew (float) – reward obtained in “hole” states;
gamma (float, .9) – discount factor;
horizon (int, 100) – the horizon.

Returns:

A FiniteMDP object built with the provided parameters.

parse_grid(grid)[source]

Parse the grid file:

Parameters:: grid (str) – the path of the file containing the grid structure;
Returns:: A list containing the grid structure.

compute_probabilities(grid_map, cell_list, prob)[source]

Compute the transition probability matrix.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells;
prob (float) – probability of success of an action.

Returns:

The transition probability matrix;

compute_reward(grid_map, cell_list, pos_rew, neg_rew)[source]

Compute the reward matrix.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells;
pos_rew (float) – reward obtained in goal states;
neg_rew (float) – reward obtained in “hole” states;

Returns:

The reward matrix.

compute_mu(grid_map, cell_list)[source]

Compute the initial states distribution.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells.

Returns:

The initial states distribution.

Simple chain

generate_simple_chain(state_n, goal_states, prob, rew, mu=None, gamma=0.9, horizon=100)[source]

Simple chain generator.

Parameters:

state_n (int) – number of states;
goal_states (list) – list of goal states;
prob (float) – probability of success of an action;
rew (float) – reward obtained in goal states;
mu (np.ndarray) – initial state probability distribution;
gamma (float, .9) – discount factor;
horizon (int, 100) – the horizon.

Returns:

A FiniteMDP object built with the provided parameters.

compute_probabilities(state_n, prob)[source]

Compute the transition probability matrix.

Parameters:

state_n (int) – number of states;
prob (float) – probability of success of an action.

Returns:

The transition probability matrix;

compute_reward(state_n, goal_states, rew)[source]

Compute the reward matrix.

Parameters:

state_n (int) – number of states;
goal_states (list) – list of goal states;
rew (float) – reward obtained in goal states.

Returns:

The reward matrix.

Taxi

generate_taxi(grid, prob=0.9, rew=(0, 1, 3, 15), gamma=0.99, horizon=inf)[source]

This Taxi generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘F’ is a passenger, when the agent steps on a hole, it picks up it. ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided. The episode terminates when the agent reaches the goal state. The reward is always 0, except for the goal state where it depends on the number of collected passengers. Each action has a certain probability of success and, if it fails, the agent goes in a perpendicular direction from the supposed one.

The grid is expected to be rectangular.

This problem is inspired from: “Bayesian Q-Learning”. Dearden R. et al.. 1998.

Parameters:

grid (str) – the path of the file containing the grid structure;
prob (float, .9) – probability of success of an action;
rew (tuple, (0, 1, 3, 15)) – rewards obtained in goal states;
gamma (float, .99) – discount factor;
horizon (int, np.inf) – the horizon.

Returns:

A FiniteMDP object built with the provided parameters.

parse_grid(grid)[source]

Parse the grid file:

Parameters:: grid (str) – the path of the file containing the grid structure.
Returns:: A list containing the grid structure.

compute_probabilities(grid_map, cell_list, passenger_list, prob)[source]

Compute the transition probability matrix.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells;
passenger_list (list) – list of passenger cells;
prob (float) – probability of success of an action.

Returns:

The transition probability matrix;

compute_reward(grid_map, cell_list, passenger_list, rew)[source]

Compute the reward matrix.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells;
passenger_list (list) – list of passenger cells;
rew (tuple) – rewards obtained in goal states.

Returns:

The reward matrix.

compute_mu(grid_map, cell_list, passenger_list)[source]

Compute the initial states distribution.

Parameters:

grid_map (list) – list containing the grid structure;
cell_list (list) – list of non-wall cells;
passenger_list (list) – list of passenger cells.

Returns:

The initial states distribution.