# Environments¶

In mushroom_rl we distinguish between two different types of environment classes:

• proper environments
• generators

While environments directly implement the Environment interface, generators are a set of methods used to generate finite markov chains that represent a specific environment e.g., grid worlds.

## Environments¶

### Atari¶

class MaxAndSkip(env, skip, max_pooling=True)[source]

Bases: gym.core.Wrapper

__init__(env, skip, max_pooling=True)[source]

Initialize self. See help(type(self)) for accurate signature.

step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, logging, and sometimes learning) observation (object)
reset(**kwargs)[source]

Resets the environment to an initial state and returns an initial observation.

This method should also reset the environment’s random number generator(s) if seed is an integer or if the environment has not yet initialized a random number generator. If the environment already has a random number generator and reset is called with seed=None, the RNG should not be reset. Moreover, reset should (in the typical use case) be called with an integer seed right after initialization and then never again.

Returns: the initial observation. info (optional dictionary): a dictionary containing extra information, this is only returned if return_info is set to true observation (object)
close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

metadata

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs
dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
np_random

Initializes the np_random field if not done already.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some third-party environments may not support rendering at all.) By convention, if mode is:

• human: render to the current display or terminal and return nothing. Usually for human consumption.
• rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
• ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Parameters: mode (str) – the mode to render with

Example:

class MyEnv(Env):

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reward_range

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns: Returns the list of seeds used in this env’s random number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example. list
unwrapped

Completely unwrap this env.

Returns: The base non-wrapped gym.Env instance gym.Env
class Atari(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

The Atari environment as presented in: “Human-level control through deep reinforcement learning”. Mnih et. al.. 2015.

__init__(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

Constructor.

Parameters: name (str) – id name of the Atari game in Gym; width (int, 84) – width of the screen; height (int, 84) – height of the screen; ends_at_life (bool, False) – whether the episode ends when a life is lost or not; max_pooling (bool, True) – whether to do max-pooling or average-pooling of the last two frames when using NoFrameskip; history_length (int, 4) – number of frames to form a state; max_no_op_actions (int, 30) – maximum number of no-op action to execute at the beginning of an episode.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

set_episode_end(ends_at_life)[source]

Setter.

Parameters: ends_at_life (bool) – whether the episode ends when a life is lost or not.

### Car on hill¶

class CarOnHill(horizon=100, gamma=0.95)[source]

The Car On Hill environment as presented in: “Tree-Based Batch Mode Reinforcement Learning”. Ernst D. et al.. 2005.

__init__(horizon=100, gamma=0.95)[source]

Constructor.

Parameters: horizon (int, 100) – horizon of the problem; gamma (float, 95) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### DeepMind Control Suite¶

class DMControl(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0, use_pixels=False, pixels_width=64, pixels_height=64)[source]

Interface for dm_control suite Mujoco environments. It makes it possible to use every dm_control suite Mujoco environment just providing the necessary information.

__init__(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0, use_pixels=False, pixels_width=64, pixels_height=64)[source]

Constructor.

Parameters: domain_name (str) – name of the environment; task_name (str) – name of the task of the environment; horizon (int) – the horizon; gamma (float) – the discount factor; task_kwargs (dict, None) – parameters of the task; dt (float, 01) – duration of a control step; width_screen (int, 480) – width of the screen; height_screen (int, 480) – height of the screen; camera_id (int, 0) – position of camera to render the environment; use_pixels (bool, False) – if True, pixel observations are used rather than the state vector; pixels_width (int, 64) – width of the pixel observation; pixels_height (int, 64) – height of the pixel observation;
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.

### Finite MDP¶

class FiniteMDP(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Finite Markov Decision Process.

__init__(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Constructor.

Parameters: p (np.ndarray) – transition probability matrix; rew (np.ndarray) – reward matrix; mu (np.ndarray, None) – initial state probability distribution; gamma (float, 9) – discount factor; horizon (int, np.inf) – the horizon.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### Grid World¶

class AbstractGridWorld(mdp_info, height, width, start, goal)[source]

Abstract class to build a grid world.

__init__(mdp_info, height, width, start, goal)[source]

Constructor.

Parameters: height (int) – height of the grid; width (int) – width of the grid; start (tuple) – x-y coordinates of the goal; goal (tuple) – x-y coordinates of the goal.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class GridWorld(height, width, goal, start=(0, 0))[source]

Standard grid world.

__init__(height, width, goal, start=(0, 0))[source]

Constructor.

Parameters: height (int) – height of the grid; width (int) – width of the grid; start (tuple) – x-y coordinates of the goal; goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

reset(state=None)

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class GridWorldVanHasselt(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

A variant of the grid world as presented in: “Double Q-Learning”. Hasselt H. V.. 2010.

__init__(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

Constructor.

Parameters: height (int) – height of the grid; width (int) – width of the grid; start (tuple) – x-y coordinates of the goal; goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

reset(state=None)

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### Gym¶

class Gym(name, horizon=None, gamma=0.99, wrappers=None, wrappers_args=None, **env_args)[source]

Interface for OpenAI Gym environments. It makes it possible to use every Gym environment just providing the id, except for the Atari games that are managed in a separate class.

__init__(name, horizon=None, gamma=0.99, wrappers=None, wrappers_args=None, **env_args)[source]

Constructor.

Parameters: name (str) – gym id of the environment; horizon (int) – the horizon. If None, use the one from Gym; gamma (float, 0.99) – the discount factor; wrappers – list of wrappers to apply over the environment. It is possible to pass arguments to the wrappers by providing a tuple with two elements: the gym wrapper class and a dictionary containing the parameters needed by the wrapper constructor;
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### Inverted pendulum¶

class InvertedPendulum(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

The Inverted Pendulum environment (continuous version) as presented in: “Reinforcement Learning In Continuous Time and Space”. Doya K.. 2000. “Off-Policy Actor-Critic”. Degris T. et al.. 2012. “Deterministic Policy Gradient Algorithms”. Silver D. et al. 2014.

__init__(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Constructor.

Parameters: random_start (bool, False) – whether to start from a random position or from the horizontal one; m (float, 1.0) – mass of the pendulum; l (float, 1.0) – length of the pendulum; g (float, 9.8) – gravity acceleration constant; mu (float, 1e-2) – friction constant of the pendulum; max_u (float, 5.0) – maximum allowed input torque; horizon (int, 5000) – horizon of the problem; gamma (int, 99) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.

### Cart Pole¶

class CartPole(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

The Inverted Pendulum on a Cart environment as presented in: “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

Constructor.

Parameters: m (float, 2.0) – mass of the pendulum; M (float, 8.0) – mass of the cart; l (float, 5) – length of the pendulum; g (float, 9.8) – gravity acceleration constant; max_u (float, 50.) – maximum allowed input torque; noise_u (float, 10.) – maximum noise on the action; horizon (int, 3000) – horizon of the problem; gamma (float, 95) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.

### LQR¶

class LQR(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

This class implements a Linear-Quadratic Regulator. This task aims to minimize the undesired deviations from nominal values of some controller settings in control problems. The system equations in this task are:

$x_{t+1} = Ax_t + Bu_t$

where x is the state and u is the control signal.

The reward function is given by:

$r_t = -\left( x_t^TQx_t + u_t^TRu_t \right)$

“Policy gradient approaches for multi-objective sequential decision making”. Parisi S., Pirotta M., Smacchia N., Bascetta L., Restelli M.. 2014

__init__(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Constructor.

Args:
A (np.ndarray): the state dynamics matrix; B (np.ndarray): the action dynamics matrix; Q (np.ndarray): reward weight matrix for state; R (np.ndarray): reward weight matrix for action; max_pos (float, np.inf): maximum value of the state; max_action (float, np.inf): maximum value of the action; random_init (bool, False): start from a random state; episodic (bool, False): end the episode when the state goes over the threshold; gamma (float, 0.9): discount factor; horizon (int, 50): horizon of the mdp.
static generate(dimensions=None, s_dim=None, a_dim=None, max_pos=inf, max_action=inf, eps=0.1, index=0, scale=1.0, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Factory method that generates an lqr with identity dynamics and symmetric reward matrices.

Parameters: dimensions (int) – number of state-action dimensions; s_dim (int) – number of state dimensions; a_dim (int) – number of action dimensions; max_pos (float, np.inf) – maximum value of the state; max_action (float, np.inf) – maximum value of the action; eps (double, 1) – reward matrix weights specifier; index (int, 0) – selector for the principal state; scale (float, 1.0) – scaling factor for the reward function; random_init (bool, False) – start from a random state; episodic (bool, False) – end the episode when the state goes over the threshold; gamma (float, 9) – discount factor; horizon (int, 50) – horizon of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### Mujoco¶

class ObservationType[source]

Bases: enum.Enum

An enum indicating the type of data that should be added to the observation of the environment, can be Joint-/Body-/Site- positions and velocities.

class MuJoCo(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]

Class to create a Mushroom environment using the MuJoCo simulator.

__init__(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]

Constructor.

Parameters: file_name (string) – The path to the XML file with which the environment should be created; actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used; observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). An entry in the list is given by: (name, type); gamma (float) – The discounting factor of the environment; horizon (int) – The maximum horizon for the environment; n_substeps (int) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly; n_intermediate_steps (int) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states. additional_data_spec (list) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType; collision_groups (list) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
seed(seed)[source]

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters: action (np.ndarray) – numpy array with the actions provided to the environment. The action to be used for the current step
_step_init(state, action)[source]

Allows information to be initialized at the start of a step.

_compute_action(action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters: action (np.ndarray) – numpy array with the actions provided at every step. The action to be set in the actual mujoco simulation.
_simulation_pre_step()[source]
Allows information to be accesed and changed at every intermediate step
before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.
ex: apply a force over X to the torso:
force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque
_simulation_post_step()[source]
Allows information to be accesed at every intermediate step
after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.
_step_finalize()[source]

Allows information to be accesed at the end of a step.

_read_data(name)[source]

Read data form the MuJoCo data structure.

Parameters: name (string) – A name referring to an entry contained the additional_data_spec list handed to the constructor. The desired data as a one-dimensional numpy array.
_write_data(name, value)[source]

Write data to the MuJoCo data structure.

Parameters: name (string) – A name referring to an entry contained in the additional_data_spec list handed to the constructor; value (ndarray) – The data that should be written.
_check_collision(group1, group2)[source]

Check for collision between the specified groups.

Parameters: group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor; group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor. A flag indicating whether a collision occurred between the given groups or not.
_get_collision_force(group1, group2)[source]

Returns the collision force and torques between the specified groups.

Parameters: group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor; group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor. A 6D vector specifying the collision forces/torques[3D force + 3D torque] between the given groups. Vector of 0’s in case there was no collision. http://mujoco.org/book/programming.html#siContact
_reward(state, action, next_state)[source]

Compute the reward based on the given transition.

Parameters: state (np.array) – the current state of the system; action (np.array) – the action that is applied in the current state; next_state (np.array) – the state reached after applying the given action. The reward as a floating point scalar value.
_is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters: state (np.array) – the state of the system. A boolean flag indicating whether this state is absorbing or not.
_setup()[source]

A function that allows to execute setup code after an environment reset.

_load_simulation(file_name, n_substeps)[source]

Parameters: file_name – The path to the XML file with which the environment should be created; The loaded mujoco model.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

### Puddle World¶

class PuddleWorld(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Puddle world as presented in: “Off-Policy Actor-Critic”. Degris T. et al.. 2012.

__init__(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Constructor.

Parameters: start (np.array, None) – starting position of the agent; goal (np.array, None) – goal position; goal_threshold (float, 1) – distance threshold of the agent from the goal to consider it reached; noise_step (float, 025) – noise in actions; noise_reward (float, 0) – standard deviation of gaussian noise in reward; reward_goal (float, 0) – reward obtained reaching goal state; thrust (float, 05) – distance walked during each action; puddle_center (np.array, None) – center of the puddle; puddle_width (np.array, None) – width of the puddle; gamma (float, 99) – discount factor. horizon (int, 5000) – horizon of the problem;
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.

### Segway¶

class Segway(random_start=False)[source]

The Segway environment (continuous version) as presented in: “Deep Learning for Actor-Critic Reinforcement Learning”. Xueli Jia. 2015.

__init__(random_start=False)[source]

Constructor.

Parameters: random_start (bool, False) – whether to start from a random position or from the horizontal one.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

### Ship steering¶

class ShipSteering(small=True, n_steps_action=3)[source]

The Ship Steering environment as presented in: “Hierarchical Policy Gradient Algorithms”. Ghavamzadeh M. and Mahadevan S.. 2013.

__init__(small=True, n_steps_action=3)[source]

Constructor.

Parameters: small (bool, True) – whether to use a small state space or not. n_steps_action (int, 3) – number of integration intervals for each step of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters: state (np.ndarray, None) – the state to set to the current state. The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters: action (np.ndarray) – the action to execute. The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters: x – the variable to bound; min_value – the minimum value; max_value – the maximum value; The bounded variable.
info

An object containing the info of the environment.

Type: Returns
static list_registered()

List registered environments.

Returns: The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters: env_name (str) – Name of the environment, *args – positional arguments to be provided to the environment generator; **kwargs – keyword arguments to be provided to the environment generator. An instance of the constructed environment.
classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters: seed (float) – the value of the seed.

## Generators¶

### Grid world¶

generate_grid_world(grid, prob, pos_rew, neg_rew, gamma=0.9, horizon=100)[source]

This Grid World generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting position where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘*’ is a hole, when the agent steps on a hole, it receives a negative reward and the episode ends; ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided.

The grid is expected to be rectangular.

Parameters: grid (str) – the path of the file containing the grid structure; prob (float) – probability of success of an action; pos_rew (float) – reward obtained in goal states; neg_rew (float) – reward obtained in “hole” states; gamma (float, 9) – discount factor; horizon (int, 100) – the horizon. A FiniteMDP object built with the provided parameters.
parse_grid(grid)[source]

Parse the grid file:

Parameters: grid (str) – the path of the file containing the grid structure; A list containing the grid structure.
compute_probabilities(grid_map, cell_list, prob)[source]

Compute the transition probability matrix.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells; prob (float) – probability of success of an action. The transition probability matrix;
compute_reward(grid_map, cell_list, pos_rew, neg_rew)[source]

Compute the reward matrix.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells; pos_rew (float) – reward obtained in goal states; neg_rew (float) – reward obtained in “hole” states; The reward matrix.
compute_mu(grid_map, cell_list)[source]

Compute the initial states distribution.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells. The initial states distribution.

### Simple chain¶

generate_simple_chain(state_n, goal_states, prob, rew, mu=None, gamma=0.9, horizon=100)[source]

Simple chain generator.

Parameters: state_n (int) – number of states; goal_states (list) – list of goal states; prob (float) – probability of success of an action; rew (float) – reward obtained in goal states; mu (np.ndarray) – initial state probability distribution; gamma (float, 9) – discount factor; horizon (int, 100) – the horizon. A FiniteMDP object built with the provided parameters.
compute_probabilities(state_n, prob)[source]

Compute the transition probability matrix.

Parameters: state_n (int) – number of states; prob (float) – probability of success of an action. The transition probability matrix;
compute_reward(state_n, goal_states, rew)[source]

Compute the reward matrix.

Parameters: state_n (int) – number of states; goal_states (list) – list of goal states; rew (float) – reward obtained in goal states. The reward matrix.

### Taxi¶

generate_taxi(grid, prob=0.9, rew=(0, 1, 3, 15), gamma=0.99, horizon=inf)[source]

This Taxi generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘F’ is a passenger, when the agent steps on a hole, it picks up it. ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided. The episode terminates when the agent reaches the goal state. The reward is always 0, except for the goal state where it depends on the number of collected passengers. Each action has a certain probability of success and, if it fails, the agent goes in a perpendicular direction from the supposed one.

The grid is expected to be rectangular.

This problem is inspired from: “Bayesian Q-Learning”. Dearden R. et al.. 1998.

Parameters: grid (str) – the path of the file containing the grid structure; prob (float, 9) – probability of success of an action; rew (tuple, (0, 1, 3, 15)) – rewards obtained in goal states; gamma (float, 99) – discount factor; horizon (int, np.inf) – the horizon. A FiniteMDP object built with the provided parameters.
parse_grid(grid)[source]

Parse the grid file:

Parameters: grid (str) – the path of the file containing the grid structure. A list containing the grid structure.
compute_probabilities(grid_map, cell_list, passenger_list, prob)[source]

Compute the transition probability matrix.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells; passenger_list (list) – list of passenger cells; prob (float) – probability of success of an action. The transition probability matrix;
compute_reward(grid_map, cell_list, passenger_list, rew)[source]

Compute the reward matrix.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells; passenger_list (list) – list of passenger cells; rew (tuple) – rewards obtained in goal states. The reward matrix.
compute_mu(grid_map, cell_list, passenger_list)[source]

Compute the initial states distribution.

Parameters: grid_map (list) – list containing the grid structure; cell_list (list) – list of non-wall cells; passenger_list (list) – list of passenger cells. The initial states distribution.