Environments

Environments

class mushroom.environments.environment.MDPInfo(observation_space, action_space, gamma, horizon)[source]

Bases: object

This class is used to store the information of the environment.

__init__(observation_space, action_space, gamma, horizon)[source]

Constructor.

Parameters:
  • observation_space ([Box, Discrete]) – the state space;
  • action_space ([Box, Discrete]) – the action space;
  • gamma (float) – the discount factor;
  • horizon (int) – the horizon.
size

The sum of the number of discrete states and discrete actions. Only works for discrete spaces.

Type:Returns
shape

The concatenation of the shape tuple of the state and action spaces.

Type:Returns

Atari

class mushroom.environments.atari.MaxAndSkip(env, skip, max_pooling=True)[source]

Bases: gym.core.Wrapper

__init__(env, skip, max_pooling=True)[source]

Initialize self. See help(type(self)) for accurate signature.

step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters:action (object) – an action provided by the agent
Returns:agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type:observation (object)
reset(**kwargs)[source]

Resets the state of the environment and returns an initial observation.

Returns:the initial observation.
Return type:observation (object)
close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Parameters:mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns:
Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
Return type:list<bigint>
unwrapped

Completely unwrap this env.

Returns:The base non-wrapped gym.Env instance
Return type:gym.Env
class mushroom.environments.atari.LazyFrames(frames, history_length)[source]

Bases: object

From OpenAI Baseline. https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py

__init__(frames, history_length)[source]

Initialize self. See help(type(self)) for accurate signature.

class mushroom.environments.atari.Atari(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

Bases: mushroom.environments.environment.Environment

The Atari environment as presented in: “Human-level control through deep reinforcement learning”. Mnih et. al.. 2015.

__init__(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

Constructor.

Parameters:
  • name (str) – id name of the Atari game in Gym;
  • width (int, 84) – width of the screen;
  • height (int, 84) – height of the screen;
  • ends_at_life (bool, False) – whether the episode ends when a life is lost or not;
  • max_pooling (bool, True) – whether to do max-pooling or average-pooling of the last two frames when using NoFrameskip;
  • history_length (int, 4) – number of frames to form a state;
  • max_no_op_actions (int, 30) – maximum number of no-op action to execute at the beginning of an episode.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
set_episode_end(ends_at_life)[source]

Setter.

Parameters:ends_at_life (bool) – whether the episode ends when a life is lost or not.

Car on hill

class mushroom.environments.car_on_hill.CarOnHill(horizon=100, gamma=0.95)[source]

Bases: mushroom.environments.environment.Environment

The Car On Hill environment as presented in: “Tree-Based Batch Mode Reinforcement Learning”. Ernst D. et al.. 2005.

__init__(horizon=100, gamma=0.95)[source]

Constructor.

reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Finite MDP

class mushroom.environments.finite_mdp.FiniteMDP(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Bases: mushroom.environments.environment.Environment

Finite Markov Decision Process.

__init__(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Constructor.

Parameters:
  • p (np.ndarray) – transition probability matrix;
  • rew (np.ndarray) – reward matrix;
  • mu (np.ndarray, None) – initial state probability distribution;
  • gamma (float, 9) – discount factor;
  • horizon (int, np.inf) – the horizon.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Grid World

class mushroom.environments.grid_world.AbstractGridWorld(mdp_info, height, width, start, goal)[source]

Bases: mushroom.environments.environment.Environment

Abstract class to build a grid world.

__init__(mdp_info, height, width, start, goal)[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class mushroom.environments.grid_world.GridWorld(height, width, goal, start=(0, 0))[source]

Bases: mushroom.environments.grid_world.AbstractGridWorld

Standard grid world.

__init__(height, width, goal, start=(0, 0))[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
reset(state=None)

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class mushroom.environments.grid_world.GridWorldVanHasselt(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

Bases: mushroom.environments.grid_world.AbstractGridWorld

A variant of the grid world as presented in: “Double Q-Learning”. Hasselt H. V.. 2010.

__init__(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
reset(state=None)

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Gym

class mushroom.environments.gym_env.Gym(name, horizon, gamma)[source]

Bases: mushroom.environments.environment.Environment

Interface for OpenAI Gym environments. It makes it possible to use every Gym environment just providing the id, except for the Atari games that are managed in a separate class.

__init__(name, horizon, gamma)[source]

Constructor.

Parameters:
  • name (str) – gym id of the environment;
  • horizon (int) – the horizon;
  • gamma (float) – the discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.

Inverted pendulum

class mushroom.environments.inverted_pendulum.InvertedPendulum(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Bases: mushroom.environments.environment.Environment

The Inverted Pendulum environment (continuous version) as presented in: “Reinforcement Learning In Continuous Time and Space”. Doya K.. 2000. “Off-Policy Actor-Critic”. Degris T. et al.. 2012. “Deterministic Policy Gradient Algorithms”. Silver D. et al. 2014.

__init__(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Constructor.

Parameters:
  • random_start (bool, False) – whether to start from a random position or from the horizontal one;
  • m (float, 1.0) – mass of the pendulum;
  • l (float, 1.0) – length of the pendulum;
  • g (float, 9.8) – gravity acceleration constant;
  • mu (float, 1e-2) – friction constant of the pendulum;
  • max_u (float, 5.0) – maximum allowed input torque;
  • horizon (int, 5000) – horizon of the problem;
  • gamma (int, 99) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
class mushroom.environments.inverted_pendulum.InvertedPendulumDiscrete(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

Bases: mushroom.environments.environment.Environment

The Inverted Pendulum environment as presented in: “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

Constructor.

Parameters:
  • m (float, 2.0) – mass of the pendulum;
  • M (float, 8.0) – mass of the cart;
  • l (float, 5) – length of the pendulum;
  • g (float, 9.8) – gravity acceleration constant;
  • mu (float, 1e-2) – friction constant of the pendulum;
  • max_u (float, 50.) – maximum allowed input torque;
  • noise_u (float, 10.) – maximum noise on the action;
  • horizon (int, 3000) – horizon of the problem;
  • gamma (int, 95) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.

LQR

class mushroom.environments.lqr.LQR(A, B, Q, R, random_init=False, gamma=0.9, horizon=50)[source]

Bases: mushroom.environments.environment.Environment

This class implements a Linear-Quadratic Regulator. This task aims to minimize the undesired deviations from nominal values of some controller settings in control problems. The system equations in this task are:

\[x_{t+1} = Ax_t + Bu_t\]

where x is the state and u is the control signal.

The reward function is given by:

\[r_t = -\left( x_t^TQx_t + u_t^TRu_t \right)\]

“Policy gradient approaches for multi-objective sequential decision making”. Parisi S., Pirotta M., Smacchia N., Bascetta L., Restelli M.. 2014

__init__(A, B, Q, R, random_init=False, gamma=0.9, horizon=50)[source]

Constructor.

Args:
A (np.ndarray): the state dynamics matrix; B (np.ndarray): the action dynamics matrix; Q (np.ndarray): reward weight matrix for state; R (np.ndarray): reward weight matrix for action; random_init (bool, False): start from a random state; gamma (float, 0.9): discount factor; horizon (int, 50): horizon of the mdp.
static generate(dimensions, eps=0.1, index=0, random_init=False, gamma=0.9, horizon=50)[source]

Factory method that generates an lqr with identity dynamics and symmetric reward matrices.

Parameters:
  • dimensions (int) – number of state-action dimensions;
  • eps (double, 0.1) – reward matrix weights specifier;
  • index (int, 0) – selector for the principal state;
  • random_init (bool, False) – start from a random state;
  • gamma (float, 0.9) – discount factor;
  • horizon (int, 50) – horizon of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Mujoco

Segway

class mushroom.environments.segway.Segway(random_start=False)[source]

Bases: mushroom.environments.environment.Environment

The Segway environment (continuous version) as presented in: “Deep Learning for Actor-Critic Reinforcement Learning”. Xueli Jia. 2015.

__init__(random_start=False)[source]

Constructor.

Parameters:random_start (bool, False) – whether to start from a random position or from the horizontal one.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Ship steering

class mushroom.environments.ship_steering.ShipSteering(small=True, n_steps_action=3)[source]

Bases: mushroom.environments.environment.Environment

The Ship Steering environment as presented in: “Hierarchical Policy Gradient Algorithms”. Ghavamzadeh M. and Mahadevan S.. 2013.

__init__(small=True, n_steps_action=3)[source]

Constructor.

Parameters:
  • small (bool, True) – whether to use a small state space or not.
  • n_steps_action (int, 3) – number of integration intervals for each step of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.

Generators

Grid world

mushroom.environments.generators.grid_world.generate_grid_world(grid, prob, pos_rew, neg_rew, gamma=0.9, horizon=100)[source]

This Grid World generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting position where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘*’ is a hole, when the agent steps on a hole, it receives a negative reward and the episode ends; ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided.

The grid is expected to be rectangular.

Parameters:
  • grid (str) – the path of the file containing the grid structure;
  • prob (float) – probability of success of an action;
  • pos_rew (float) – reward obtained in goal states;
  • neg_rew (float) – reward obtained in “hole” states;
  • gamma (float, 9) – discount factor;
  • horizon (int, 100) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

mushroom.environments.generators.grid_world.parse_grid(grid)[source]

Parse the grid file:

Parameters:grid (str) – the path of the file containing the grid structure;
Returns:A list containing the grid structure.
mushroom.environments.generators.grid_world.compute_probabilities(grid_map, cell_list, prob)[source]

Compute the transition probability matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

mushroom.environments.generators.grid_world.compute_reward(grid_map, cell_list, pos_rew, neg_rew)[source]

Compute the reward matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • pos_rew (float) – reward obtained in goal states;
  • neg_rew (float) – reward obtained in “hole” states;
Returns:

The reward matrix.

mushroom.environments.generators.grid_world.compute_mu(grid_map, cell_list)[source]

Compute the initial states distribution.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells.
Returns:

The initial states distribution.

Simple chain

mushroom.environments.generators.simple_chain.generate_simple_chain(state_n, goal_states, prob, rew, mu=None, gamma=0.9, horizon=100)[source]

Simple chain generator.

Parameters:
  • state_n (int) – number of states;
  • goal_states (list) – list of goal states;
  • prob (float) – probability of success of an action;
  • rew (float) – reward obtained in goal states;
  • mu (np.ndarray) – initial state probability distribution;
  • gamma (float, 9) – discount factor;
  • horizon (int, 100) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

mushroom.environments.generators.simple_chain.compute_probabilities(state_n, prob)[source]

Compute the transition probability matrix.

Parameters:
  • state_n (int) – number of states;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

mushroom.environments.generators.simple_chain.compute_reward(state_n, goal_states, rew)[source]

Compute the reward matrix.

Parameters:
  • state_n (int) – number of states;
  • goal_states (list) – list of goal states;
  • rew (float) – reward obtained in goal states.
Returns:

The reward matrix.

Taxi

mushroom.environments.generators.taxi.generate_taxi(grid, prob=0.9, rew=(0, 1, 3, 15), gamma=0.99, horizon=inf)[source]

This Taxi generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘F’ is a passenger, when the agent steps on a hole, it picks up it. ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided. The episode terminates when the agent reaches the goal state. The reward is always 0, except for the goal state where it depends on the number of collected passengers. Each action has a certain probability of success and, if it fails, the agent goes in a perpendicular direction from the supposed one.

The grid is expected to be rectangular.

This problem is inspired from: “Bayesian Q-Learning”. Dearden R. et al.. 1998.

Parameters:
  • grid (str) – the path of the file containing the grid structure;
  • prob (float, 9) – probability of success of an action;
  • rew (tuple, (0, 1, 3, 15)) – rewards obtained in goal states;
  • gamma (float, 99) – discount factor;
  • horizon (int, np.inf) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

mushroom.environments.generators.taxi.parse_grid(grid)[source]

Parse the grid file:

Parameters:grid (str) – the path of the file containing the grid structure.
Returns:A list containing the grid structure.
mushroom.environments.generators.taxi.compute_probabilities(grid_map, cell_list, passenger_list, prob)[source]

Compute the transition probability matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

mushroom.environments.generators.taxi.compute_reward(grid_map, cell_list, passenger_list, rew)[source]

Compute the reward matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells;
  • rew (tuple) – rewards obtained in goal states.
Returns:

The reward matrix.

mushroom.environments.generators.taxi.compute_mu(grid_map, cell_list, passenger_list)[source]

Compute the initial states distribution.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells.
Returns:

The initial states distribution.