MushroomRL

Introduction

What is MushroomRL

MushroomRL is a Reinforcement Learning (RL) library developed to be a simple, yet powerful way to make RL and deep RL experiments. The idea behind MushroomRL is to offer the majority of RL algorithms providing a common interface in order to run them without excessive effort. Moreover, it is designed in such a way that new algorithms and other stuff can be added transparently, without the need of editing other parts of the code. MushroomRL is compatible with RL libraries like OpenAI Gym, DeepMind Control Suite, Pybullet, and MuJoCo, and the PyTorch library for tensor computation.

With MushroomRL you can:

  • solve RL problems simply writing a single small script;
  • add custom algorithms, policies, and so on, transparently;
  • use all RL environments offered by well-known libraries and build customized environments as well;
  • exploit regression models offered by third-party libraries (e.g., scikit-learn) or build a customized one with PyTorch;
  • seamlessly run experiments on CPU or GPU.

Basic run example

Solve a discrete MDP in few a lines. Firstly, create a MDP:

from mushroom_rl.environments import GridWorld

mdp = GridWorld(width=3, height=3, goal=(2, 2), start=(0, 0))

Then, an epsilon-greedy policy with:

from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.parameters import Parameter

epsilon = Parameter(value=1.)
policy = EpsGreedy(epsilon=epsilon)

Eventually, the agent is:

from mushroom_rl.algorithms.value import QLearning

learning_rate = Parameter(value=.6)
agent = QLearning(mdp.info, policy, learning_rate)

Learn:

from mushroom_rl.core import Core

core = Core(agent, mdp)
core.learn(n_steps=10000, n_steps_per_fit=1)

Print final Q-table:

import numpy as np

shape = agent.Q.shape
q = np.zeros(shape)
for i in range(shape[0]):
    for j in range(shape[1]):
        state = np.array([i])
        action = np.array([j])
        q[i, j] = agent.Q.predict(state, action)
print(q)

Results in:

[[  6.561   7.29    6.561   7.29 ]
 [  7.29    8.1     6.561   8.1  ]
 [  8.1     9.      7.29    8.1  ]
 [  6.561   8.1     7.29    8.1  ]
 [  7.29    9.      7.29    9.   ]
 [  8.1    10.      8.1     9.   ]
 [  7.29    8.1     8.1     9.   ]
 [  8.1     9.      8.1    10.   ]
 [  0.      0.      0.      0.   ]]

where the Q-values of each action of the MDP are stored for each rows representing a state of the MDP.

Download and installation

MushroomRL can be downloaded from the GitHub repository. Installation can be done running

pip3 install mushroom_rl

To compile the documentation:

cd mushroom_rl/docs
make html

or to compile the pdf version:

cd mushroom_rl/docs
make latexpdf

To launch MushroomRL test suite:

pytest

Installation troubleshooting

Common problems with the installation of MushroomRL arise in case some of its dependency are broken or not installed. In general, we recommend installing MushroomRL with the option all to install all the Python dependencies. The installation time mostly depends on the time to install the dependencies. A simple installation takes approximately 1 minute with a fast internet connection. Installing with all the dependencies takes approximately 5 minutes using a fast internet connection. A slower internet connection may increase the installation time significantly.

If installing all the dependencies, ensure that the swig library is installed, as it is used by some Gym environments and the installation may fail otherwise. For Atari, you might need to install the ROM separately, otherwise the creation of Atari environments may fail. Opencv should be installed too. For MuJoCo, ensure that the path of your MuJoCo folder is included in the environment variable LD_LIBRARY_PATH and that mujoco_py is correctly installed. Installing MushroomRL in a Conda environment is generally safe. However, we are aware that when installing with the option plots, some errors may arise due to incompatibility issues between pyqtgraph and Conda. We recommend not using Conda when installing using plots. Finally, ensure that C/C++ compilers and Cython are working as expected.

To check if the installation has been successful, you can try to run the basic example above.

MushroomRL is well-tested on Linux. If you are using another OS, you may incur in issues that we are still not aware of. In that case, please do not hesitate sending us an email at mushroom4rl@gmail.com.

MushroomRL vs other libraries

MushroomRL offers the majority of classical and deep RL algorithms, while keeping a modular and flexible architecture. It is compatible with Pytorch, and most machine learning and RL libraries.

Features

MushroomRL

Stable Baselines

RLLib

Keras RL

Chainer RL

Tensorforce

Classic RL algorithms

Deep RL algorithms

Updated documentation

Modular

Easy to extend

PEP8 compliant

Compatible with RL benchmarks

Benchmarking suite

MujoCo integration

Pybullet integration

Torch integration

Tensorflow integration

Chainer integration

Parallel environments

API Documentation

Agent-Environment Interface

The three basic interface of mushroom_rl are the Agent, the Environment and the Core interface.

  • The Agent is the basic interface for any Reinforcement Learning algorithm.
  • The Environment is the basic interface for every problem/task that the agent should solve.
  • The Core is a class used to control the interaction between an agent and an environment.

To implement serialization of MushroomRL data on the disk (load/save functionality) we also provide the Serializable interface. Finally, we provide the logging functionality with the Logger class.

Agent

MushroomRL provides the implementations of several algorithms belonging to all categories of RL:

  • value-based;
  • policy-search;
  • actor-critic.

One can easily implement customized algorithms following the structure of the already available ones, by extending the following interface:

class Agent(mdp_info, policy, features=None)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements the functions to manage the agent (e.g. move the agent following its policy).

__init__(mdp_info, policy, features=None)[source]

Constructor.

Parameters:
  • mdp_info (MDPInfo) – information about the MDP;
  • policy (Policy) – the policy followed by the agent;
  • features (object, None) – features to extract from the state.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
draw_action(state)[source]

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()[source]

Called by the agent when a new episode starts.

stop()[source]

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

set_logger(logger)[source]

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Environment

MushroomRL provides several implementation of well known benchmarks with both continuous and discrete action spaces.

To implement a new environment, it is mandatory to use the following interface:

class MDPInfo(observation_space, action_space, gamma, horizon)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class is used to store the information of the environment.

__init__(observation_space, action_space, gamma, horizon)[source]

Constructor.

Parameters:
  • observation_space ([Box, Discrete]) – the state space;
  • action_space ([Box, Discrete]) – the action space;
  • gamma (float) – the discount factor;
  • horizon (int) – the horizon.
size

The sum of the number of discrete states and discrete actions. Only works for discrete spaces.

Type:Returns
shape

The concatenation of the shape tuple of the state and action spaces.

Type:Returns
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class Environment(mdp_info)[source]

Bases: object

Basic interface used by any mushroom environment.

classmethod register()[source]

Register an environment in the environment list.

static list_registered()[source]

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)[source]

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

__init__(mdp_info)[source]

Constructor.

Parameters:mdp_info (MDPInfo) – an object containing the info of the environment.
seed(seed)[source]

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

info

An object containing the info of the environment.

Type:Returns
static _bound(x, min_value, max_value)[source]

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

Core

class Core(agent, mdp, callbacks_fit=None, callback_step=None, preprocessors=None)[source]

Bases: object

Implements the functions to run a generic algorithm.

__init__(agent, mdp, callbacks_fit=None, callback_step=None, preprocessors=None)[source]

Constructor.

Parameters:
  • agent (Agent) – the agent moving according to a policy;
  • mdp (Environment) – the environment in which the agent moves;
  • callbacks_fit (list) – list of callbacks to execute at the end of each fit;
  • callback_step (Callback) – callback to execute after each step;
  • preprocessors (list) – list of state preprocessors to be applied to state variables before feeding them to the agent.
learn(n_steps=None, n_episodes=None, n_steps_per_fit=None, n_episodes_per_fit=None, render=False, quiet=False)[source]

This function moves the agent in the environment and fits the policy using the collected samples. The agent can be moved for a given number of steps or a given number of episodes and, independently from this choice, the policy can be fitted after a given number of steps or a given number of episodes. By default, the environment is reset.

Parameters:
  • n_steps (int, None) – number of steps to move the agent;
  • n_episodes (int, None) – number of episodes to move the agent;
  • n_steps_per_fit (int, None) – number of steps between each fit of the policy;
  • n_episodes_per_fit (int, None) – number of episodes between each fit of the policy;
  • render (bool, False) – whether to render the environment or not;
  • quiet (bool, False) – whether to show the progress bar or not.
evaluate(initial_states=None, n_steps=None, n_episodes=None, render=False, quiet=False)[source]

This function moves the agent in the environment using its policy. The agent is moved for a provided number of steps, episodes, or from a set of initial states for the whole episode. By default, the environment is reset.

Parameters:
  • initial_states (np.ndarray, None) – the starting states of each episode;
  • n_steps (int, None) – number of steps to move the agent;
  • n_episodes (int, None) – number of episodes to move the agent;
  • render (bool, False) – whether to render the environment or not;
  • quiet (bool, False) – whether to show the progress bar or not.
_step(render)[source]

Single step.

Parameters:render (bool) – whether to render or not.
Returns:A tuple containing the previous state, the action sampled by the agent, the reward obtained, the reached state, the absorbing flag of the reached state and the last step flag.
reset(initial_states=None)[source]

Reset the state of the agent.

_preprocess(state)[source]

Method to apply state preprocessors.

Parameters:state (np.ndarray) – the state to be preprocessed.
Returns:The preprocessed state.

Serialization

class Serializable[source]

Bases: object

Interface to implement serialization of a MushroomRL object. This provide load and save functionality to save the object in a zip file. It is possible to save the state of the agent with different levels of

save(path, full_save=False)[source]

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')[source]

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
classmethod load(path)[source]

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
copy()[source]
Returns:A deepcopy of the agent.
_add_save_attr(**attr_dict)[source]

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

__init__

Initialize self. See help(type(self)) for accurate signature.

Logger

class Logger(log_name='', results_dir='./logs', log_console=False, use_timestamp=False, append=False, seed=None, **kwargs)[source]

Bases: mushroom_rl.core.logger.data_logger.DataLogger, mushroom_rl.core.logger.console_logger.ConsoleLogger

This class implements the logging functionality. It can be used to create automatically a log directory, save numpy data array and the current agent.

__init__(log_name='', results_dir='./logs', log_console=False, use_timestamp=False, append=False, seed=None, **kwargs)[source]

Constructor.

Parameters:
  • log_name (string, '') – name of the current experiment directory if not specified, the current timestamp is used.
  • results_dir (string, './logs') – name of the base logging directory. If set to None, no directory is created;
  • log_console (bool, False) – whether to log or not the console output;
  • use_timestamp (bool, False) – If true, adds the current timestamp to the folder name;
  • append (bool, False) – If true, the logger will append the new data logged to the one already existing in the directory;
  • seed (int, None) – seed for the current run. It can be optionally specified to add a seed suffix for each data file logged;
  • **kwargs – other parameters for ConsoleLogger class.
critical(msg)

Log a message with CRITICAL level

debug(msg)

Log a message with DEBUG level

epoch_info(epoch, **kwargs)

Log the epoch info with the format: Epoch <epoch number> | <label 1>: <data 1> <label 2> <data 2> …

Parameters:
  • epoch (int) – epoch number;
  • **kwargs – the labels and the data to be displayed.
error(msg)

Log a message with ERROR level

exception(msg)

Log a message with ERROR level. To be called only from an exception handler

info(msg)

Log a message with INFO level

log_agent(agent, epoch=None, full_save=False)

Log agent into the log folder.

Parameters:
  • agent (Agent) – The agent to be saved;
  • epoch (int, None) – optional epoch number to be added to the agent file currently saved;
  • full_save (bool, False) – whether to save the full data from the agent or not.
log_best_agent(agent, J, full_save=False)

Log the best agent so far into the log folder. The agent is logged only if the current performance is better than the performance of the previously stored agent.

Parameters:
  • agent (Agent) – The agent to be saved;
  • J (float) – The performance metric of the current agent;
  • full_save (bool, False) – whether to save the full data from the agent or not.
log_numpy(**kwargs)

Log scalars into numpy arrays.

Parameters:**kwargs – set of named scalar values to be saved. The argument name will be used to identify the given quantity and as base file name.
path

Property to return the path to the current logging directory

strong_line()

Log a line of #

warning(msg)

Log a message with WARNING level

weak_line()

Log a line of -

class ConsoleLogger(log_name, log_dir=None, suffix='', log_file_name=None, console_log_level=10, file_log_level=10)[source]

Bases: object

This class implements the console logging functionality. It can be used to log text into the console and optionally save a log file.

__init__(log_name, log_dir=None, suffix='', log_file_name=None, console_log_level=10, file_log_level=10)[source]

Constructor.

Parameters:
  • log_name (str, None) – Name of the current logger.
  • log_dir (Path, None) – path of the logging directory. If None, no the console output is not logged into a file;
  • suffix (int, None) – optional string to add a suffix to the logger id and to the data file logged;
  • log_file_name (str, None) – optional specifier for log file name, id is used by default;
  • console_log_level (int, logging.DEBUG) – logging level for console;
  • file_log_level (int, logging.DEBUG) – logging level for file.
debug(msg)[source]

Log a message with DEBUG level

info(msg)[source]

Log a message with INFO level

warning(msg)[source]

Log a message with WARNING level

error(msg)[source]

Log a message with ERROR level

critical(msg)[source]

Log a message with CRITICAL level

exception(msg)[source]

Log a message with ERROR level. To be called only from an exception handler

strong_line()[source]

Log a line of #

weak_line()[source]

Log a line of -

epoch_info(epoch, **kwargs)[source]

Log the epoch info with the format: Epoch <epoch number> | <label 1>: <data 1> <label 2> <data 2> …

Parameters:
  • epoch (int) – epoch number;
  • **kwargs – the labels and the data to be displayed.
class DataLogger(results_dir, suffix='', append=False)[source]

Bases: object

This class implements the data logging functionality. It can be used to create automatically a log directory, save numpy data array and the current agent.

__init__(results_dir, suffix='', append=False)[source]

Constructor.

Parameters:
  • results_dir (Path) – path of the logging directory;
  • suffix (string) – optional string to add a suffix to each data file logged;
  • append (bool, False) – If true, the logger will append the new data logged to the one already existing in the directory.
log_numpy(**kwargs)[source]

Log scalars into numpy arrays.

Parameters:**kwargs – set of named scalar values to be saved. The argument name will be used to identify the given quantity and as base file name.
log_agent(agent, epoch=None, full_save=False)[source]

Log agent into the log folder.

Parameters:
  • agent (Agent) – The agent to be saved;
  • epoch (int, None) – optional epoch number to be added to the agent file currently saved;
  • full_save (bool, False) – whether to save the full data from the agent or not.
log_best_agent(agent, J, full_save=False)[source]

Log the best agent so far into the log folder. The agent is logged only if the current performance is better than the performance of the previously stored agent.

Parameters:
  • agent (Agent) – The agent to be saved;
  • J (float) – The performance metric of the current agent;
  • full_save (bool, False) – whether to save the full data from the agent or not.
path

Property to return the path to the current logging directory

Actor-Critic

Classical Actor-Critic Methods

class COPDAC_Q(mdp_info, policy, mu, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]

Bases: mushroom_rl.core.agent.Agent

Compatible off-policy deterministic actor-critic algorithm. “Deterministic Policy Gradient Algorithms”. Silver D. et al.. 2014.

__init__(mdp_info, policy, mu, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]

Constructor.

Parameters:
  • mu (Regressor) – regressor that describe the deterministic policy to be learned i.e., the deterministic mapping between state and action.
  • alpha_theta ([float, Parameter]) – learning rate for policy update;
  • alpha_omega ([float, Parameter]) – learning rate for the advantage function;
  • alpha_v ([float, Parameter]) – learning rate for the value function;
  • value_function_features (Features, None) – features used by the value function approximator;
  • policy_features (Features, None) – features used by the policy.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class StochasticAC(mdp_info, policy, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]

Bases: mushroom_rl.core.agent.Agent

Stochastic Actor critic in the episodic setting as presented in: “Model-Free Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__(mdp_info, policy, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]

Constructor.

Parameters:
  • alpha_theta ([float, Parameter]) – learning rate for policy update;
  • alpha_v ([float, Parameter]) – learning rate for the value function;
  • lambda_par ([float, Parameter], 9) – trace decay parameter;
  • value_function_features (Features, None) – features used by the value function approximator;
  • policy_features (Features, None) – features used by the policy.
episode_start()[source]

Called by the agent when a new episode starts.

fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class StochasticAC_AVG(mdp_info, policy, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]

Bases: mushroom_rl.algorithms.actor_critic.classic_actor_critic.stochastic_ac.StochasticAC

Stochastic Actor critic in the average reward setting as presented in: “Model-Free Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__(mdp_info, policy, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]

Constructor.

Parameters:alpha_r (Parameter) – learning rate for the reward trace.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Deep Actor-Critic Methods

class DeepAC(mdp_info, policy, actor_optimizer, parameters)[source]

Bases: mushroom_rl.core.agent.Agent

Base class for algorithms that uses the reparametrization trick, such as SAC, DDPG and TD3.

__init__(mdp_info, policy, actor_optimizer, parameters)[source]

Constructor.

Parameters:
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • parameters (list) – policy parameters to be optimized.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_optimize_actor_parameters(loss)[source]

Method used to update actor parameters to maximize a given loss.

Parameters:loss (torch.tensor) – the loss computed by the algorithm.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class A2C(mdp_info, policy, actor_optimizer, critic_params, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]

Bases: mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC

Advantage Actor Critic algorithm (A2C). Synchronous version of the A3C algorithm. “Asynchronous Methods for Deep Reinforcement Learning”. Mnih V. et. al.. 2016.

__init__(mdp_info, policy, actor_optimizer, critic_params, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]

Constructor.

Parameters:
  • policy (TorchPolicy) – torch policy to be learned by the algorithm;
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • critic_params (dict) – parameters of the critic approximator to build;
  • ent_coeff ([float, Parameter], 0) – coefficient for the entropy penalty;
  • max_grad_norm (float, None) – maximum norm for gradient clipping. If None, no clipping will be performed, unless specified otherwise in actor_optimizer;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_optimize_actor_parameters(loss)

Method used to update actor parameters to maximize a given loss.

Parameters:loss (torch.tensor) – the loss computed by the algorithm.
copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class DDPG(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=1, critic_fit_params=None, actor_predict_params=None, critic_predict_params=None)[source]

Bases: mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC

Deep Deterministic Policy Gradient algorithm. “Continuous Control with Deep Reinforcement Learning”. Lillicrap T. P. et al.. 2016.

__init__(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=1, critic_fit_params=None, actor_predict_params=None, critic_predict_params=None)[source]

Constructor.

Parameters:
  • policy_class (Policy) – class of the policy;
  • policy_params (dict) – parameters of the policy to build;
  • actor_params (dict) – parameters of the actor approximator to build;
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • critic_params (dict) – parameters of the critic approximator to build;
  • batch_size ([int, Parameter]) – the number of samples in a batch;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • tau ((float, Parameter)) – value of coefficient for soft updates;
  • policy_delay ([int, Parameter], 1) – the number of updates of the critic after which an actor update is implemented;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator;
  • actor_predict_params (dict, None) – parameters for the prediction with the actor approximator;
  • critic_predict_params (dict, None) – parameters for the prediction with the critic approximator.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_optimize_actor_parameters(loss)

Method used to update actor parameters to maximize a given loss.

Parameters:loss (torch.tensor) – the loss computed by the algorithm.
copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class TD3(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]

Bases: mushroom_rl.algorithms.actor_critic.deep_actor_critic.ddpg.DDPG

Twin Delayed DDPG algorithm. “Addressing Function Approximation Error in Actor-Critic Methods”. Fujimoto S. et al.. 2018.

__init__(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]

Constructor.

Parameters:
  • policy_class (Policy) – class of the policy;
  • policy_params (dict) – parameters of the policy to build;
  • actor_params (dict) – parameters of the actor approximator to build;
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • critic_params (dict) – parameters of the critic approximator to build;
  • batch_size ([int, Parameter]) – the number of samples in a batch;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • tau ([float, Parameter]) – value of coefficient for soft updates;
  • policy_delay ([int, Parameter], 2) – the number of updates of the critic after which an actor update is implemented;
  • noise_std ([float, Parameter], 2) – standard deviation of the noise used for policy smoothing;
  • noise_clip ([float, Parameter], 5) – maximum absolute value for policy smoothing noise;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.
_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_optimize_actor_parameters(loss)

Method used to update actor parameters to maximize a given loss.

Parameters:loss (torch.tensor) – the loss computed by the algorithm.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class SAC(mdp_info, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, log_std_min=-20, log_std_max=2, target_entropy=None, critic_fit_params=None)[source]

Bases: mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC

Soft Actor-Critic algorithm. “Soft Actor-Critic Algorithms and Applications”. Haarnoja T. et al.. 2019.

__init__(mdp_info, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, log_std_min=-20, log_std_max=2, target_entropy=None, critic_fit_params=None)[source]

Constructor.

Parameters:
  • actor_mu_params (dict) – parameters of the actor mean approximator to build;
  • actor_sigma_params (dict) – parameters of the actor sigm approximator to build;
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • critic_params (dict) – parameters of the critic approximator to build;
  • batch_size ((int, Parameter)) – the number of samples in a batch;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • warmup_transitions ([int, Parameter]) – number of samples to accumulate in the replay memory to start the policy fitting;
  • tau ([float, Parameter]) – value of coefficient for soft updates;
  • lr_alpha ([float, Parameter]) – Learning rate for the entropy coefficient;
  • log_std_min ([float, Parameter]) – Min value for the policy log std;
  • log_std_max ([float, Parameter]) – Max value for the policy log std;
  • target_entropy (float, None) – target entropy for the policy, if None a default value is computed ;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

_optimize_actor_parameters(loss)

Method used to update actor parameters to maximize a given loss.

Parameters:loss (torch.tensor) – the loss computed by the algorithm.
copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

class TRPO(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e-10, critic_fit_params=None)[source]

Bases: mushroom_rl.core.agent.Agent

Trust Region Policy optimization algorithm. “Trust Region Policy Optimization”. Schulman J. et al.. 2015.

__init__(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e-10, critic_fit_params=None)[source]

Constructor.

Parameters:
  • policy (TorchPolicy) – torch policy to be learned by the algorithm
  • critic_params (dict) – parameters of the critic approximator to build;
  • ent_coeff ([float, Parameter], 0) – coefficient for the entropy penalty;
  • max_kl ([float, Parameter], 001) – maximum kl allowed for every policy update;
  • float (lam) – lambda coefficient used by generalized advantage estimation;
  • n_epochs_line_search ([int, Parameter], 10) – maximum number of iterations of the line search algorithm;
  • n_epochs_cg ([int, Parameter], 10) – maximum number of iterations of the conjugate gradient algorithm;
  • cg_damping ([float, Parameter], 1e-2) – damping factor for the conjugate gradient algorithm;
  • cg_residual_tol ([float, Parameter], 1e-10) – conjugate gradient residual tolerance;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class PPO(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, ent_coeff=0.0, critic_fit_params=None)[source]

Bases: mushroom_rl.core.agent.Agent

Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al.. 2017.

__init__(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, ent_coeff=0.0, critic_fit_params=None)[source]

Constructor.

Parameters:
  • policy (TorchPolicy) – torch policy to be learned by the algorithm
  • actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
  • critic_params (dict) – parameters of the critic approximator to build;
  • n_epochs_policy ([int, Parameter]) – number of policy updates for every dataset;
  • batch_size ([int, Parameter]) – size of minibatches for every optimization step
  • eps_ppo ([float, Parameter]) – value for probability ratio clipping;
  • lam ([float, Parameter], 1.) – lambda coefficient used by generalized advantage estimation;
  • ent_coeff ([float, Parameter], 1.) – coefficient for the entropy regularization term;
  • critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Value-Based

TD

class SARSA(mdp_info, policy, learning_rate)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

SARSA algorithm.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • learning_rate (Parameter) – the learning rate.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class SARSALambda(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

The SARSA(lambda) algorithm for finite MDPs.

__init__(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Constructor.

Parameters:
  • lambda_coeff ([float, Parameter]) – eligibility trace coefficient;
  • trace (str, 'replacing') – type of eligibility trace to use.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
episode_start()[source]

Called by the agent when a new episode starts.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class ExpectedSARSA(mdp_info, policy, learning_rate)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Expected SARSA algorithm. “A theoretical and empirical analysis of Expected Sarsa”. Seijen H. V. et al.. 2009.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • learning_rate (Parameter) – the learning rate.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class QLearning(mdp_info, policy, learning_rate)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Q-Learning algorithm. “Learning from Delayed Rewards”. Watkins C.J.C.H.. 1989.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • learning_rate (Parameter) – the learning rate.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class QLambda(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Q(Lambda) algorithm. “Learning from Delayed Rewards”. Watkins C.J.C.H.. 1989.

__init__(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Constructor.

Parameters:
  • lambda_coeff ([float, Parameter]) – eligibility trace coefficient;
  • trace (str, 'replacing') – type of eligibility trace to use.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
episode_start()[source]

Called by the agent when a new episode starts.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class DoubleQLearning(mdp_info, policy, learning_rate)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Double Q-Learning algorithm. “Double Q-Learning”. Hasselt H. V.. 2010.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • learning_rate (Parameter) – the learning rate.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class SpeedyQLearning(mdp_info, policy, learning_rate)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Speedy Q-Learning algorithm. “Speedy Q-Learning”. Ghavamzadeh et. al.. 2011.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • learning_rate (Parameter) – the learning rate.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class RLearning(mdp_info, policy, learning_rate, beta)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

R-Learning algorithm. “A Reinforcement Learning Method for Maximizing Undiscounted Rewards”. Schwartz A.. 1993.

__init__(mdp_info, policy, learning_rate, beta)[source]

Constructor.

Parameters:beta ([float, Parameter]) – beta coefficient.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class WeightedQLearning(mdp_info, policy, learning_rate, sampling=True, precision=1000)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Weighted Q-Learning algorithm. “Estimating the Maximum Expected Value through Gaussian Approximation”. D’Eramo C. et. al.. 2016.

__init__(mdp_info, policy, learning_rate, sampling=True, precision=1000)[source]

Constructor.

Parameters:
  • sampling (bool, True) – use the approximated version to speed up the computation;
  • precision (int, 1000) – number of samples to use in the approximated version.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_next_q(next_state)[source]
Parameters:next_state (np.ndarray) – the state where next action has to be evaluated.
Returns:The weighted estimator value in next_state.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class MaxminQLearning(mdp_info, policy, learning_rate, n_tables)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Maxmin Q-Learning algorithm without replay memory. “Maxmin Q-learning: Controlling the Estimation Bias of Q-learning”. Lan Q. et al. 2019.

__init__(mdp_info, policy, learning_rate, n_tables)[source]

Constructor.

Parameters:n_tables (int) – number of tables in the ensemble.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class RQLearning(mdp_info, policy, learning_rate, off_policy=False, beta=None, delta=None)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

RQ-Learning algorithm. “Exploiting Structure and Uncertainty of Bellman Updates in Markov Decision Processes”. Tateo D. et al.. 2017.

__init__(mdp_info, policy, learning_rate, off_policy=False, beta=None, delta=None)[source]

Constructor.

Parameters:
  • off_policy (bool, False) – whether to use the off policy setting or the online one;
  • beta ([float, Parameter], None) – beta coefficient;
  • delta ([float, Parameter], None) – delta coefficient.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

_next_q(next_state)[source]
Parameters:next_state (np.ndarray) – the state where next action has to be evaluated.
Returns:The weighted estimator value in ‘next_state’.
class SARSALambdaContinuous(mdp_info, policy, approximator, learning_rate, lambda_coeff, features, approximator_params=None)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

Continuous version of SARSA(lambda) algorithm.

__init__(mdp_info, policy, approximator, learning_rate, lambda_coeff, features, approximator_params=None)[source]

Constructor.

Parameters:lambda_coeff ([float, Parameter]) – eligibility trace coefficient.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
episode_start()[source]

Called by the agent when a new episode starts.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class TrueOnlineSARSALambda(mdp_info, policy, learning_rate, lambda_coeff, features, approximator_params=None)[source]

Bases: mushroom_rl.algorithms.value.td.td.TD

True Online SARSA(lambda) with linear function approximation. “True Online TD(lambda)”. Seijen H. V. et al.. 2014.

__init__(mdp_info, policy, learning_rate, lambda_coeff, features, approximator_params=None)[source]

Constructor.

Parameters:lambda_coeff ([float, Parameter]) – eligibility trace coefficient.
_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:
  • state (np.ndarray) – state;
  • action (np.ndarray) – action;
  • reward (np.ndarray) – reward;
  • next_state (np.ndarray) – next state;
  • absorbing (np.ndarray) – absorbing flag.
episode_start()[source]

Called by the agent when a new episode starts.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _parse(dataset)

Utility to parse the dataset that is supposed to contain only a sample.

Parameters:dataset (list) – the current episode step.
Returns:A tuple containing state, action, reward, next state, absorbing and last flag.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Batch TD

class FQI(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Bases: mushroom_rl.algorithms.value.batch_td.batch_td.BatchTD

Fitted Q-Iteration algorithm. “Tree-Based Batch Mode Reinforcement Learning”, Ernst D. et al.. 2005.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:
  • n_iterations ([int, Parameter]) – number of iterations to perform for training;
  • quiet (bool, False) – whether to show the progress bar or not.
fit(x)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class DoubleFQI(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Bases: mushroom_rl.algorithms.value.batch_td.fqi.FQI

Double Fitted Q-Iteration algorithm. “Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems”. D’Eramo C. et al.. 2017.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:
  • n_iterations ([int, Parameter]) – number of iterations to perform for training;
  • quiet (bool, False) – whether to show the progress bar or not.
fit(x)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class BoostedFQI(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Bases: mushroom_rl.algorithms.value.batch_td.fqi.FQI

Boosted Fitted Q-Iteration algorithm. “Boosted Fitted Q-Iteration”. Tosatto S. et al.. 2017.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:
  • n_iterations ([int, Parameter]) – number of iterations to perform for training;
  • quiet (bool, False) – whether to show the progress bar or not.
fit(x)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class LSPI(mdp_info, policy, approximator_params=None, epsilon=0.01, fit_params=None, features=None)[source]

Bases: mushroom_rl.algorithms.value.batch_td.batch_td.BatchTD

Least-Squares Policy Iteration algorithm. “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__(mdp_info, policy, approximator_params=None, epsilon=0.01, fit_params=None, features=None)[source]

Constructor.

Parameters:epsilon ([float, Parameter], 1e-2) – termination coefficient.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger)

Setter that can be used to pass a logger to the algorithm

Parameters:logger (Logger) – the logger to be used by the algorithm.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

DQN

class AbstractDQN(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)[source]

Bases: mushroom_rl.core.agent.Agent

__init__(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)[source]

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • approximator_params (dict) – parameters of the approximator to build;
  • batch_size ([int, Parameter]) – the number of samples in a batch;
  • target_update_frequency (int) – the number of samples collected between each update of the target network;
  • replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
  • predict_params (dict, None) – parameters for the prediction with the approximator;
  • clip_reward (bool, False) – whether to clip the reward or not.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
draw_action(state)[source]

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
_update_target()[source]

Update the target network.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

copy()
Returns:A deepcopy of the agent.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

set_logger(logger, loss_filename='loss_Q')[source]

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
class DQN(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)[source]

Bases: mushroom_rl.algorithms.value.dqn.abstract_dqn.AbstractDQN

Deep Q-Network algorithm. “Human-Level Control Through Deep Reinforcement Learning”. Mnih V. et al.. 2015.

_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

__init__(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • approximator_params (dict) – parameters of the approximator to build;
  • batch_size ([int, Parameter]) – the number of samples in a batch;
  • target_update_frequency (int) – the number of samples collected between each update of the target network;
  • replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
  • predict_params (dict, None) – parameters for the prediction with the approximator;
  • clip_reward (bool, False) – whether to clip the reward or not.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class DoubleDQN(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)[source]

Bases: mushroom_rl.algorithms.value.dqn.dqn.DQN

Double DQN algorithm. “Deep Reinforcement Learning with Double Q-Learning”. Hasselt H. V. et al.. 2016.

_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

__init__(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False)

Constructor.

Parameters:
  • approximator (object) – the approximator to use to fit the Q-function;
  • approximator_params (dict) – parameters of the approximator to build;
  • batch_size ([int, Parameter]) – the number of samples in a batch;
  • target_update_frequency (int) – the number of samples collected between each update of the target network;
  • replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
  • initial_replay_size (int) – the number of samples to collect before starting the learning;
  • max_replay_size (int) – the maximum number of samples in the replay memory;
  • fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
  • predict_params (dict, None) – parameters for the prediction with the approximator;
  • clip_reward (bool, False) – whether to clip the reward or not.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class AveragedDQN(mdp_info, policy, approximator, n_approximators, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.abstract_dqn.AbstractDQN

Averaged-DQN algorithm. “Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning”. Anschel O. et al.. 2017.

__init__(mdp_info, policy, approximator, n_approximators, **params)[source]

Constructor.

Parameters:n_approximators (int) – the number of target approximators to store.
_update_target()[source]

Update the target network.

_next_q(next_state, absorbing)[source]
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class MaxminDQN(mdp_info, policy, approximator, n_approximators, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.dqn.DQN

MaxminDQN algorithm. “Maxmin Q-learning: Controlling the Estimation Bias of Q-learning”. Lan Q. et al.. 2020.

__init__(mdp_info, policy, approximator, n_approximators, **params)[source]

Constructor.

Parameters:n_approximators (int) – the number of approximators in the ensemble.
fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
_update_target()[source]

Update the target network.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class DuelingDQN(mdp_info, policy, approximator_params, avg_advantage=True, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.dqn.DQN

Dueling DQN algorithm. “Dueling Network Architectures for Deep Reinforcement Learning”. Wang Z. et al.. 2016.

__init__(mdp_info, policy, approximator_params, avg_advantage=True, **params)[source]

Constructor.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class CategoricalDQN(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.abstract_dqn.AbstractDQN

Categorical DQN algorithm. “A Distributional Perspective on Reinforcement Learning”. Bellemare M. et al.. 2017.

__init__(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, **params)[source]

Constructor.

Parameters:
  • n_atoms (int) – number of atoms;
  • v_min (float) – minimum value of value-function;
  • v_max (float) – maximum value of value-function.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class NoisyDQN(mdp_info, policy, approximator_params, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.dqn.DQN

Noisy DQN algorithm. “Noisy networks for exploration”. Fortunato M. et al.. 2018.

__init__(mdp_info, policy, approximator_params, **params)[source]

Constructor.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class Rainbow(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, n_steps_return, alpha_coeff, beta, sigma_coeff=0.5, **params)[source]

Bases: mushroom_rl.algorithms.value.dqn.abstract_dqn.AbstractDQN

Rainbow algorithm. “Rainbow: Combinining Improvements in Deep Reinforcement Learning”. Hessel M. et al.. 2018.

__init__(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, n_steps_return, alpha_coeff, beta, sigma_coeff=0.5, **params)[source]

Constructor.

Parameters:
  • n_atoms (int) – number of atoms;
  • v_min (float) – minimum value of value-function;
  • v_max (float) – maximum value of value-function;
  • n_steps_return (int) – the number of steps to consider to compute the n-return;
  • alpha_coeff (float) – prioritization exponent for prioritized experience replay;
  • beta (Parameter) – importance sampling coefficient for prioritized experience replay;
  • sigma_coeff (float, 5) – sigma0 coefficient for noise initialization in noisy layers.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_next_q(next_state, absorbing)
Parameters:
  • next_state (np.ndarray) – the states where next action has to be evaluated;
  • absorbing (np.ndarray) – the absorbing flag for the states in next_state.
Returns:

Maximum action-value for each state in next_state.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target()

Update the target network.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action to be executed.
episode_start()

Called by the agent when a new episode starts.

fit(dataset)[source]

Fit step.

Parameters:dataset (list) – the dataset.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename='loss_Q')

Setter that can be used to pass a logger to the algorithm

Parameters:
  • logger (Logger) – the logger to be used by the algorithm;
  • loss_filename (str, 'loss_Q') – optional string to specify the loss filename.
stop()

Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Approximators

MushroomRL exposes the high-level class Regressor that can manage any type of function regressor. This class is a wrapper for any kind of function approximator, e.g. a scikit-learn approximator or a pytorch neural network.

Regressor

class Regressor(approximator, input_shape, output_shape=None, n_actions=None, n_models=None, **params)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements the function to manage a function approximator. This class selects the appropriate kind of regressor to implement according to the parameters provided by the user; this makes this class the only one to use for each kind of task that has to be performed. The inference of the implementation to choose is done checking the provided values of parameters n_actions. If n_actions is provided, it means that the user wants to implement an approximator of the Q-function: if the value of n_actions is equal to the output_shape then a QRegressor is created, else (output_shape should be (1,)) an ActionRegressor is created. Otherwise a GenericRegressor is created. An Ensemble model can be used for all the previous implementations listed before simply providing a n_models parameter greater than 1.

__init__(approximator, input_shape, output_shape=None, n_actions=None, n_models=None, **params)[source]

Constructor.

Parameters:
  • approximator (class) – the approximator class to use to create the model;
  • input_shape (tuple) – the shape of the input of the model;
  • output_shape (tuple, None) – the shape of the output of the model;
  • n_actions (int, None) – number of actions considered to create a QRegressor or an ActionRegressor;
  • n_models (int, 1) – number of models to create;
  • **params – other parameters to create each model.
__call__(*z, **predict_params)[source]

Call self as a function.

fit(*z, **fit_params)[source]

Fit the model.

Parameters:
  • *z – list of input of the model;
  • **fit_params – parameters to use to fit the model.
predict(*z, **predict_params)[source]

Predict the output of the model given an input.

Parameters:
  • *z – list of input of the model;
  • **predict_params – parameters to use to predict with the model.
Returns:

The model prediction.

model

The model object.

Type:Returns
reset()[source]

Reset the model parameters.

input_shape

The shape of the input of the model.

Type:Returns
output_shape

The shape of the output of the model.

Type:Returns
weights_size

The shape of the weights of the model.

Type:Returns
get_weights()[source]
Returns:The weights of the model.
set_weights(w)[source]
Parameters:w (list) – list of weights to be set in the model.
diff(*z)[source]
Parameters:*z – the input of the model.
Returns:The derivative of the model.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_logger(logger, loss_filename=None)[source]

Setter that can be used to pass a logger to the regressor.

Parameters:
  • logger (Logger) – the logger to be used by the regressor;
  • loss_filename (str, None) – optional string to specify the loss filename.

Approximator

Linear
class LinearApproximator(weights=None, input_shape=None, output_shape=(1, ), **kwargs)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements a linear approximator.

__init__(weights=None, input_shape=None, output_shape=(1, ), **kwargs)[source]

Constructor.

Parameters:
  • weights (np.ndarray) – array of weights to initialize the weights of the approximator;
  • input_shape (np.ndarray, None) – the shape of the input of the model;
  • output_shape (np.ndarray, (1,)) – the shape of the output of the model;
  • **kwargs – other params of the approximator.
fit(x, y, **fit_params)[source]

Fit the model.

Parameters:
  • x (np.ndarray) – input;
  • y (np.ndarray) – target;
  • **fit_params – other parameters used by the fit method of the regressor.
predict(x, **predict_params)[source]

Predict.

Parameters:
  • x (np.ndarray) – input;
  • **predict_params – other parameters used by the predict method the regressor.
Returns:

The predictions of the model.

weights_size

The size of the array of weights.

Type:Returns
get_weights()[source]

Getter.

Returns:The set of weights of the approximator.
set_weights(w)[source]

Setter.

Parameters:w (np.ndarray) – the set of weights to set.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action=None)[source]

Compute the derivative of the output w.r.t. state, and action if provided.

Parameters:
  • state (np.ndarray) – the state;
  • action (np.ndarray, None) – the action.
Returns:

The derivative of the output w.r.t. state, and action if provided.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
Torch Approximator
class TorchApproximator(input_shape, output_shape, network, optimizer=None, loss=None, batch_size=0, n_fit_targets=1, use_cuda=False, reinitialize=False, dropout=False, quiet=True, **params)[source]

Bases: mushroom_rl.core.serialization.Serializable

Class to interface a pytorch model to the mushroom Regressor interface. This class implements all is needed to use a generic pytorch model and train it using a specified optimizer and objective function. This class supports also minibatches.

__init__(input_shape, output_shape, network, optimizer=None, loss=None, batch_size=0, n_fit_targets=1, use_cuda=False, reinitialize=False, dropout=False, quiet=True, **params)[source]

Constructor.

Parameters:
  • input_shape (tuple) – shape of the input of the network;
  • output_shape (tuple) – shape of the output of the network;
  • network (torch.nn.Module) – the network class to use;
  • optimizer (dict) – the optimizer used for every fit step;
  • loss (torch.nn.functional) – the loss function to optimize in the fit method;
  • batch_size (int, 0) – the size of each minibatch. If 0, the whole dataset is fed to the optimizer at each epoch;
  • n_fit_targets (int, 1) – the number of fit targets used by the fit method of the network;
  • use_cuda (bool, False) – if True, runs the network on the GPU;
  • reinitialize (bool, False) – if True, the approximator is re
  • at every fit call. To perform the initialization, the (initialized) –
  • method must be defined properly for the selected (weights_init) –
  • network. (model) –
  • dropout (bool, False) – if True, dropout is applied only during train;
  • quiet (bool, True) – if False, shows two progress bars, one for epochs and one for the minibatches;
  • **params – dictionary of parameters needed to construct the network.
predict(*args, output_tensor=False, **kwargs)[source]

Predict.

Parameters:
  • *args – input;
  • output_tensor (bool, False) – whether to return the output as tensor or not;
  • **kwargs – other parameters used by the predict method the regressor.
Returns:

The predictions of the model.

fit(*args, n_epochs=None, weights=None, epsilon=None, patience=1, validation_split=1.0, **kwargs)[source]

Fit the model.

Parameters:
  • *args – input, where the last n_fit_targets elements are considered as the target, while the others are considered as input;
  • n_epochs (int, None) – the number of training epochs;
  • weights (np.ndarray, None) – the weights of each sample in the computation of the loss;
  • epsilon (float, None) – the coefficient used for early stopping;
  • patience (float, 1.) – the number of epochs to wait until stop the learning if not improving;
  • validation_split (float, 1.) – the percentage of the dataset to use as training set;
  • **kwargs – other parameters used by the fit method of the regressor.
set_weights(weights)[source]

Setter.

Parameters:w (np.ndarray) – the set of weights to set.
get_weights()[source]

Getter.

Returns:The set of weights of the approximator.
weights_size

The size of the array of weights.

Type:Returns
diff(*args, **kwargs)[source]

Compute the derivative of the output w.r.t. state, and action if provided.

Parameters:
  • state (np.ndarray) – the state;
  • action (np.ndarray, None) – the action.
Returns:

The derivative of the output w.r.t. state, and action if provided.

_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Distributions

class Distribution[source]

Bases: mushroom_rl.core.serialization.Serializable

Interface for Distributions to represent a generic probability distribution. Probability distributions are often used by black box optimization algorithms in order to perform exploration in parameter space. In literature, they are also known as high level policies.

sample()[source]

Draw a sample from the distribution.

Returns:A random vector sampled from the distribution.
log_pdf(theta)[source]

Compute the logarithm of the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the log pdf is calculated
Returns:The value of the log pdf in the specified point.
__call__(theta)[source]

Compute the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the pdf is calculated
Returns:The value of the pdf in the specified point.
entropy()[source]

Compute the entropy of the distribution.

Returns:The value of the entropy of the distribution.
mle(theta, weights=None)[source]

Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.

Parameters:
  • theta (np.ndarray) – a set of points, every row is a sample
  • weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.
diff_log(theta)[source]

Compute the derivative of the logarithm of the probability density function in the specified point.

Parameters:
  • theta (np.ndarray) – the point where the gradient of the log pdf is
  • computed.
Returns:

The gradient of the log pdf in the specified point.

diff(theta)[source]

Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]
Parameters:
  • theta (np.ndarray) – the point where the gradient of the pdf is
  • calculated.
Returns:

The gradient of the pdf in the specified point.

get_parameters()[source]

Getter.

Returns:The current distribution parameters.
set_parameters(rho)[source]

Setter.

Parameters:rho (np.ndarray) – the vector of the new parameters to be used by the distribution
parameters_size

Property.

Returns:The size of the distribution parameters.
__init__

Initialize self. See help(type(self)) for accurate signature.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Gaussian

class GaussianDistribution(mu, sigma)[source]

Bases: mushroom_rl.distributions.distribution.Distribution

Gaussian distribution with fixed covariance matrix. The parameters vector represents only the mean.

__init__(mu, sigma)[source]

Constructor.

Parameters:
  • mu (np.ndarray) – initial mean of the distribution;
  • sigma (np.ndarray) – covariance matrix of the distribution.
sample()[source]

Draw a sample from the distribution.

Returns:A random vector sampled from the distribution.
log_pdf(theta)[source]

Compute the logarithm of the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the log pdf is calculated
Returns:The value of the log pdf in the specified point.
__call__(theta)[source]

Compute the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the pdf is calculated
Returns:The value of the pdf in the specified point.
entropy()[source]

Compute the entropy of the distribution.

Returns:The value of the entropy of the distribution.
mle(theta, weights=None)[source]

Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.

Parameters:
  • theta (np.ndarray) – a set of points, every row is a sample
  • weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.
diff_log(theta)[source]

Compute the derivative of the logarithm of the probability density function in the specified point.

Parameters:
  • theta (np.ndarray) – the point where the gradient of the log pdf is
  • computed.
Returns:

The gradient of the log pdf in the specified point.

get_parameters()[source]

Getter.

Returns:The current distribution parameters.
set_parameters(rho)[source]

Setter.

Parameters:rho (np.ndarray) – the vector of the new parameters to be used by the distribution
parameters_size

Property.

Returns:The size of the distribution parameters.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(theta)

Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]
Parameters:
  • theta (np.ndarray) – the point where the gradient of the pdf is
  • calculated.
Returns:

The gradient of the pdf in the specified point.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class GaussianDiagonalDistribution(mu, std)[source]

Bases: mushroom_rl.distributions.distribution.Distribution

Gaussian distribution with diagonal covariance matrix. The parameters vector represents the mean and the standard deviation for each dimension.

__init__(mu, std)[source]

Constructor.

Parameters:
  • mu (np.ndarray) – initial mean of the distribution;
  • std (np.ndarray) – initial vector of standard deviations for each variable of the distribution.
sample()[source]

Draw a sample from the distribution.

Returns:A random vector sampled from the distribution.
log_pdf(theta)[source]

Compute the logarithm of the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the log pdf is calculated
Returns:The value of the log pdf in the specified point.
__call__(theta)[source]

Compute the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the pdf is calculated
Returns:The value of the pdf in the specified point.
entropy()[source]

Compute the entropy of the distribution.

Returns:The value of the entropy of the distribution.
mle(theta, weights=None)[source]

Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.

Parameters:
  • theta (np.ndarray) – a set of points, every row is a sample
  • weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.
diff_log(theta)[source]

Compute the derivative of the logarithm of the probability density function in the specified point.

Parameters:
  • theta (np.ndarray) – the point where the gradient of the log pdf is
  • computed.
Returns:

The gradient of the log pdf in the specified point.

get_parameters()[source]

Getter.

Returns:The current distribution parameters.
set_parameters(rho)[source]

Setter.

Parameters:rho (np.ndarray) – the vector of the new parameters to be used by the distribution
parameters_size

Property.

Returns:The size of the distribution parameters.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(theta)

Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]
Parameters:
  • theta (np.ndarray) – the point where the gradient of the pdf is
  • calculated.
Returns:

The gradient of the pdf in the specified point.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class GaussianCholeskyDistribution(mu, sigma)[source]

Bases: mushroom_rl.distributions.distribution.Distribution

Gaussian distribution with full covariance matrix. The parameters vector represents the mean and the Cholesky decomposition of the covariance matrix. This parametrization enforce the covariance matrix to be positive definite.

__init__(mu, sigma)[source]

Constructor.

Parameters:
  • mu (np.ndarray) – initial mean of the distribution;
  • sigma (np.ndarray) – initial covariance matrix of the distribution.
sample()[source]

Draw a sample from the distribution.

Returns:A random vector sampled from the distribution.
log_pdf(theta)[source]

Compute the logarithm of the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the log pdf is calculated
Returns:The value of the log pdf in the specified point.
__call__(theta)[source]

Compute the probability density function in the specified point

Parameters:theta (np.ndarray) – the point where the pdf is calculated
Returns:The value of the pdf in the specified point.
entropy()[source]

Compute the entropy of the distribution.

Returns:The value of the entropy of the distribution.
mle(theta, weights=None)[source]

Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.

Parameters:
  • theta (np.ndarray) – a set of points, every row is a sample
  • weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.
diff_log(theta)[source]

Compute the derivative of the logarithm of the probability density function in the specified point.

Parameters:
  • theta (np.ndarray) – the point where the gradient of the log pdf is
  • computed.
Returns:

The gradient of the log pdf in the specified point.

get_parameters()[source]

Getter.

Returns:The current distribution parameters.
set_parameters(rho)[source]

Setter.

Parameters:rho (np.ndarray) – the vector of the new parameters to be used by the distribution
parameters_size

Property.

Returns:The size of the distribution parameters.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(theta)

Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]
Parameters:
  • theta (np.ndarray) – the point where the gradient of the pdf is
  • calculated.
Returns:

The gradient of the pdf in the specified point.

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Environments

In mushroom_rl we distinguish between two different types of environment classes:

  • proper environments
  • generators

While environments directly implement the Environment interface, generators are a set of methods used to generate finite markov chains that represent a specific environment e.g., grid worlds.

Environments

Atari
class MaxAndSkip(env, skip, max_pooling=True)[source]

Bases: gym.core.Wrapper

__init__(env, skip, max_pooling=True)[source]

Initialize self. See help(type(self)) for accurate signature.

step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters:action (object) – an action provided by the agent
Returns:agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, logging, and sometimes learning)
Return type:observation (object)
reset(**kwargs)[source]

Resets the environment to an initial state and returns an initial observation.

This method should also reset the environment’s random number generator(s) if seed is an integer or if the environment has not yet initialized a random number generator. If the environment already has a random number generator and reset is called with seed=None, the RNG should not be reset. Moreover, reset should (in the typical use case) be called with an integer seed right after initialization and then never again.

Returns:the initial observation. info (optional dictionary): a dictionary containing extra information, this is only returned if return_info is set to true
Return type:observation (object)
close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

metadata

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs
dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
np_random

Initializes the np_random field if not done already.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some third-party environments may not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render_modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Parameters:mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render_modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reward_range

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns:
Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
Return type:list<bigint>
unwrapped

Completely unwrap this env.

Returns:The base non-wrapped gym.Env instance
Return type:gym.Env
class Atari(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

Bases: mushroom_rl.core.environment.Environment

The Atari environment as presented in: “Human-level control through deep reinforcement learning”. Mnih et. al.. 2015.

__init__(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]

Constructor.

Parameters:
  • name (str) – id name of the Atari game in Gym;
  • width (int, 84) – width of the screen;
  • height (int, 84) – height of the screen;
  • ends_at_life (bool, False) – whether the episode ends when a life is lost or not;
  • max_pooling (bool, True) – whether to do max-pooling or average-pooling of the last two frames when using NoFrameskip;
  • history_length (int, 4) – number of frames to form a state;
  • max_no_op_actions (int, 30) – maximum number of no-op action to execute at the beginning of an episode.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

set_episode_end(ends_at_life)[source]

Setter.

Parameters:ends_at_life (bool) – whether the episode ends when a life is lost or not.
Car on hill
class CarOnHill(horizon=100, gamma=0.95)[source]

Bases: mushroom_rl.core.environment.Environment

The Car On Hill environment as presented in: “Tree-Based Batch Mode Reinforcement Learning”. Ernst D. et al.. 2005.

__init__(horizon=100, gamma=0.95)[source]

Constructor.

Parameters:
  • horizon (int, 100) – horizon of the problem;
  • gamma (float, 95) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

DeepMind Control Suite
class DMControl(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0)[source]

Bases: mushroom_rl.core.environment.Environment

Interface for dm_control suite Mujoco environments. It makes it possible to use every dm_control suite Mujoco environment just providing the necessary information.

__init__(domain_name, task_name, horizon=None, gamma=0.99, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0)[source]

Constructor.

Parameters:
  • domain_name (str) – name of the environment;
  • task_name (str) – name of the task of the environment;
  • horizon (int) – the horizon;
  • gamma (float) – the discount factor;
  • task_kwargs (dict, None) – parameters of the task;
  • dt (float, 01) – duration of a control step;
  • width_screen (int, 480) – width of the screen;
  • height_screen (int, 480) – height of the screen;
  • camera_id (int, 0) – position of camera to render the environment;
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
Finite MDP
class FiniteMDP(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Bases: mushroom_rl.core.environment.Environment

Finite Markov Decision Process.

__init__(p, rew, mu=None, gamma=0.9, horizon=inf)[source]

Constructor.

Parameters:
  • p (np.ndarray) – transition probability matrix;
  • rew (np.ndarray) – reward matrix;
  • mu (np.ndarray, None) – initial state probability distribution;
  • gamma (float, 9) – discount factor;
  • horizon (int, np.inf) – the horizon.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Grid World
class AbstractGridWorld(mdp_info, height, width, start, goal)[source]

Bases: mushroom_rl.core.environment.Environment

Abstract class to build a grid world.

__init__(mdp_info, height, width, start, goal)[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class GridWorld(height, width, goal, start=(0, 0))[source]

Bases: mushroom_rl.environments.grid_world.AbstractGridWorld

Standard grid world.

__init__(height, width, goal, start=(0, 0))[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

reset(state=None)

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

class GridWorldVanHasselt(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

Bases: mushroom_rl.environments.grid_world.AbstractGridWorld

A variant of the grid world as presented in: “Double Q-Learning”. Hasselt H. V.. 2010.

__init__(height=3, width=3, goal=(0, 2), start=(2, 0))[source]

Constructor.

Parameters:
  • height (int) – height of the grid;
  • width (int) – width of the grid;
  • start (tuple) – x-y coordinates of the goal;
  • goal (tuple) – x-y coordinates of the goal.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

reset(state=None)

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
step(action)

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Gym
class Gym(name, horizon=None, gamma=0.99, wrappers=None, wrappers_args=None, **env_args)[source]

Bases: mushroom_rl.core.environment.Environment

Interface for OpenAI Gym environments. It makes it possible to use every Gym environment just providing the id, except for the Atari games that are managed in a separate class.

__init__(name, horizon=None, gamma=0.99, wrappers=None, wrappers_args=None, **env_args)[source]

Constructor.

Parameters:
  • name (str) – gym id of the environment;
  • horizon (int) – the horizon. If None, use the one from Gym;
  • gamma (float, 0.99) – the discount factor;
  • wrappers – list of wrappers to apply over the environment. It is possible to pass arguments to the wrappers by providing a tuple with two elements: the gym wrapper class and a dictionary containing the parameters needed by the wrapper constructor;
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Inverted pendulum
class InvertedPendulum(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Bases: mushroom_rl.core.environment.Environment

The Inverted Pendulum environment (continuous version) as presented in: “Reinforcement Learning In Continuous Time and Space”. Doya K.. 2000. “Off-Policy Actor-Critic”. Degris T. et al.. 2012. “Deterministic Policy Gradient Algorithms”. Silver D. et al. 2014.

__init__(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]

Constructor.

Parameters:
  • random_start (bool, False) – whether to start from a random position or from the horizontal one;
  • m (float, 1.0) – mass of the pendulum;
  • l (float, 1.0) – length of the pendulum;
  • g (float, 9.8) – gravity acceleration constant;
  • mu (float, 1e-2) – friction constant of the pendulum;
  • max_u (float, 5.0) – maximum allowed input torque;
  • horizon (int, 5000) – horizon of the problem;
  • gamma (int, 99) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
Cart Pole
class CartPole(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

Bases: mushroom_rl.core.environment.Environment

The Inverted Pendulum on a Cart environment as presented in: “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]

Constructor.

Parameters:
  • m (float, 2.0) – mass of the pendulum;
  • M (float, 8.0) – mass of the cart;
  • l (float, 5) – length of the pendulum;
  • g (float, 9.8) – gravity acceleration constant;
  • max_u (float, 50.) – maximum allowed input torque;
  • noise_u (float, 10.) – maximum noise on the action;
  • horizon (int, 3000) – horizon of the problem;
  • gamma (float, 95) – discount factor.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
LQR
class LQR(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Bases: mushroom_rl.core.environment.Environment

This class implements a Linear-Quadratic Regulator. This task aims to minimize the undesired deviations from nominal values of some controller settings in control problems. The system equations in this task are:

\[x_{t+1} = Ax_t + Bu_t\]

where x is the state and u is the control signal.

The reward function is given by:

\[r_t = -\left( x_t^TQx_t + u_t^TRu_t \right)\]

“Policy gradient approaches for multi-objective sequential decision making”. Parisi S., Pirotta M., Smacchia N., Bascetta L., Restelli M.. 2014

__init__(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Constructor.

Args:
A (np.ndarray): the state dynamics matrix; B (np.ndarray): the action dynamics matrix; Q (np.ndarray): reward weight matrix for state; R (np.ndarray): reward weight matrix for action; max_pos (float, np.inf): maximum value of the state; max_action (float, np.inf): maximum value of the action; random_init (bool, False): start from a random state; episodic (bool, False): end the episode when the state goes over the threshold; gamma (float, 0.9): discount factor; horizon (int, 50): horizon of the mdp.
static generate(dimensions=None, s_dim=None, a_dim=None, max_pos=inf, max_action=inf, eps=0.1, index=0, scale=1.0, random_init=False, episodic=False, gamma=0.9, horizon=50, initial_state=None)[source]

Factory method that generates an lqr with identity dynamics and symmetric reward matrices.

Parameters:
  • dimensions (int) – number of state-action dimensions;
  • s_dim (int) – number of state dimensions;
  • a_dim (int) – number of action dimensions;
  • max_pos (float, np.inf) – maximum value of the state;
  • max_action (float, np.inf) – maximum value of the action;
  • eps (double, 1) – reward matrix weights specifier;
  • index (int, 0) – selector for the principal state;
  • scale (float, 1.0) – scaling factor for the reward function;
  • random_init (bool, False) – start from a random state;
  • episodic (bool, False) – end the episode when the state goes over the threshold;
  • gamma (float, 9) – discount factor;
  • horizon (int, 50) – horizon of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Mujoco
class ObservationType[source]

Bases: enum.Enum

An enum indicating the type of data that should be added to the observation of the environment, can be Joint-/Body-/Site- positions and velocities.

class MuJoCo(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]

Bases: mushroom_rl.core.environment.Environment

Class to create a Mushroom environment using the MuJoCo simulator.

__init__(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]

Constructor.

Parameters:
  • file_name (string) – The path to the XML file with which the environment should be created;
  • actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
  • observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). An entry in the list is given by: (name, type);
  • gamma (float) – The discounting factor of the environment;
  • horizon (int) – The maximum horizon for the environment;
  • n_substeps (int) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
  • n_intermediate_steps (int) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
  • additional_data_spec (list) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
  • collision_groups (list) – A list containing groups of geoms for which collisions should be checked during simulation via check_collision. The entries are given as: (key, geom_names), where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.
seed(seed)[source]

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

_preprocess_action(action)[source]

Compute a transformation of the action provided to the environment.

Parameters:action (np.ndarray) – numpy array with the actions provided to the environment.
Returns:The action to be used for the current step
_step_init(state, action)[source]

Allows information to be initialized at the start of a step.

_compute_action(action)[source]

Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.

Parameters:action (np.ndarray) – numpy array with the actions provided at every step.
Returns:The action to be set in the actual mujoco simulation.
_simulation_pre_step()[source]
Allows information to be accesed and changed at every intermediate step
before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.
ex: apply a force over X to the torso:
force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque
_simulation_post_step()[source]
Allows information to be accesed at every intermediate step
after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.
_step_finalize()[source]

Allows information to be accesed at the end of a step.

_read_data(name)[source]

Read data form the MuJoCo data structure.

Parameters:name (string) – A name referring to an entry contained the additional_data_spec list handed to the constructor.
Returns:The desired data as a one-dimensional numpy array.
_write_data(name, value)[source]

Write data to the MuJoCo data structure.

Parameters:
  • name (string) – A name referring to an entry contained in the additional_data_spec list handed to the constructor;
  • value (ndarray) – The data that should be written.
_check_collision(group1, group2)[source]

Check for collision between the specified groups.

Parameters:
  • group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
  • group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.
Returns:

A flag indicating whether a collision occurred between the given groups or not.

_get_collision_force(group1, group2)[source]

Returns the collision force and torques between the specified groups.

Parameters:
  • group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
  • group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.
Returns:

A 6D vector specifying the collision forces/torques[3D force + 3D torque] between the given groups. Vector of 0’s in case there was no collision. http://mujoco.org/book/programming.html#siContact

_reward(state, action, next_state)[source]

Compute the reward based on the given transition.

Parameters:
  • state (np.array) – the current state of the system;
  • action (np.array) – the action that is applied in the current state;
  • next_state (np.array) – the state reached after applying the given action.
Returns:

The reward as a floating point scalar value.

_is_absorbing(state)[source]

Check whether the given state is an absorbing state or not.

Parameters:state (np.array) – the state of the system.
Returns:A boolean flag indicating whether this state is absorbing or not.
_setup()[source]

A function that allows to execute setup code after an environment reset.

_load_simulation(file_name, n_substeps)[source]

Load mujoco model. Can be overridden to provide custom load functions.

Parameters:file_name – The path to the XML file with which the environment should be created;
Returns:The loaded mujoco model.
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

Puddle World
class PuddleWorld(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Bases: mushroom_rl.core.environment.Environment

Puddle world as presented in: “Off-Policy Actor-Critic”. Degris T. et al.. 2012.

__init__(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]

Constructor.

Parameters:
  • start (np.array, None) – starting position of the agent;
  • goal (np.array, None) – goal position;
  • goal_threshold (float, 1) – distance threshold of the agent from the goal to consider it reached;
  • noise_step (float, 025) – noise in actions;
  • noise_reward (float, 0) – standard deviation of gaussian noise in reward;
  • reward_goal (float, 0) – reward obtained reaching goal state;
  • thrust (float, 05) – distance walked during each action;
  • puddle_center (np.array, None) – center of the puddle;
  • puddle_width (np.array, None) – width of the puddle;
  • gamma (float, 99) – discount factor.
  • horizon (int, 5000) – horizon of the problem;
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
Segway
class Segway(random_start=False)[source]

Bases: mushroom_rl.core.environment.Environment

The Segway environment (continuous version) as presented in: “Deep Learning for Actor-Critic Reinforcement Learning”. Xueli Jia. 2015.

__init__(random_start=False)[source]

Constructor.

Parameters:random_start (bool, False) – whether to start from a random position or from the horizontal one.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.
stop()

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

Ship steering
class ShipSteering(small=True, n_steps_action=3)[source]

Bases: mushroom_rl.core.environment.Environment

The Ship Steering environment as presented in: “Hierarchical Policy Gradient Algorithms”. Ghavamzadeh M. and Mahadevan S.. 2013.

__init__(small=True, n_steps_action=3)[source]

Constructor.

Parameters:
  • small (bool, True) – whether to use a small state space or not.
  • n_steps_action (int, 3) – number of integration intervals for each step of the mdp.
reset(state=None)[source]

Reset the current state.

Parameters:state (np.ndarray, None) – the state to set to the current state.
Returns:The current state.
step(action)[source]

Move the agent from its current state according to the action.

Parameters:action (np.ndarray) – the action to execute.
Returns:The state reached by the agent executing action in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).
stop()[source]

Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openai-gym rendering

static _bound(x, min_value, max_value)

Method used to bound state and action variables.

Parameters:
  • x – the variable to bound;
  • min_value – the minimum value;
  • max_value – the maximum value;
Returns:

The bounded variable.

info

An object containing the info of the environment.

Type:Returns
static list_registered()

List registered environments.

Returns:The list of the registered environments.
static make(env_name, *args, **kwargs)

Generate an environment given an environment name and parameters. The environment is created using the generate method, if available. Otherwise, the constructor is used. The generate method has a simpler interface than the constructor, making it easier to generate a standard version of the environment. If the environment name contains a ‘.’ separator, the string is splitted, the first element is used to select the environment and the other elements are passed as positional parameters.

Parameters:
  • env_name (str) – Name of the environment,
  • *args – positional arguments to be provided to the environment generator;
  • **kwargs – keyword arguments to be provided to the environment generator.
Returns:

An instance of the constructed environment.

classmethod register()

Register an environment in the environment list.

seed(seed)

Set the seed of the environment.

Parameters:seed (float) – the value of the seed.

Generators

Grid world
generate_grid_world(grid, prob, pos_rew, neg_rew, gamma=0.9, horizon=100)[source]

This Grid World generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting position where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘*’ is a hole, when the agent steps on a hole, it receives a negative reward and the episode ends; ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided.

The grid is expected to be rectangular.

Parameters:
  • grid (str) – the path of the file containing the grid structure;
  • prob (float) – probability of success of an action;
  • pos_rew (float) – reward obtained in goal states;
  • neg_rew (float) – reward obtained in “hole” states;
  • gamma (float, 9) – discount factor;
  • horizon (int, 100) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

parse_grid(grid)[source]

Parse the grid file:

Parameters:grid (str) – the path of the file containing the grid structure;
Returns:A list containing the grid structure.
compute_probabilities(grid_map, cell_list, prob)[source]

Compute the transition probability matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

compute_reward(grid_map, cell_list, pos_rew, neg_rew)[source]

Compute the reward matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • pos_rew (float) – reward obtained in goal states;
  • neg_rew (float) – reward obtained in “hole” states;
Returns:

The reward matrix.

compute_mu(grid_map, cell_list)[source]

Compute the initial states distribution.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells.
Returns:

The initial states distribution.

Simple chain
generate_simple_chain(state_n, goal_states, prob, rew, mu=None, gamma=0.9, horizon=100)[source]

Simple chain generator.

Parameters:
  • state_n (int) – number of states;
  • goal_states (list) – list of goal states;
  • prob (float) – probability of success of an action;
  • rew (float) – reward obtained in goal states;
  • mu (np.ndarray) – initial state probability distribution;
  • gamma (float, 9) – discount factor;
  • horizon (int, 100) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

compute_probabilities(state_n, prob)[source]

Compute the transition probability matrix.

Parameters:
  • state_n (int) – number of states;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

compute_reward(state_n, goal_states, rew)[source]

Compute the reward matrix.

Parameters:
  • state_n (int) – number of states;
  • goal_states (list) – list of goal states;
  • rew (float) – reward obtained in goal states.
Returns:

The reward matrix.

Taxi
generate_taxi(grid, prob=0.9, rew=(0, 1, 3, 15), gamma=0.99, horizon=inf)[source]

This Taxi generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘F’ is a passenger, when the agent steps on a hole, it picks up it. ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided. The episode terminates when the agent reaches the goal state. The reward is always 0, except for the goal state where it depends on the number of collected passengers. Each action has a certain probability of success and, if it fails, the agent goes in a perpendicular direction from the supposed one.

The grid is expected to be rectangular.

This problem is inspired from: “Bayesian Q-Learning”. Dearden R. et al.. 1998.

Parameters:
  • grid (str) – the path of the file containing the grid structure;
  • prob (float, 9) – probability of success of an action;
  • rew (tuple, (0, 1, 3, 15)) – rewards obtained in goal states;
  • gamma (float, 99) – discount factor;
  • horizon (int, np.inf) – the horizon.
Returns:

A FiniteMDP object built with the provided parameters.

parse_grid(grid)[source]

Parse the grid file:

Parameters:grid (str) – the path of the file containing the grid structure.
Returns:A list containing the grid structure.
compute_probabilities(grid_map, cell_list, passenger_list, prob)[source]

Compute the transition probability matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells;
  • prob (float) – probability of success of an action.
Returns:

The transition probability matrix;

compute_reward(grid_map, cell_list, passenger_list, rew)[source]

Compute the reward matrix.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells;
  • rew (tuple) – rewards obtained in goal states.
Returns:

The reward matrix.

compute_mu(grid_map, cell_list, passenger_list)[source]

Compute the initial states distribution.

Parameters:
  • grid_map (list) – list containing the grid structure;
  • cell_list (list) – list of non-wall cells;
  • passenger_list (list) – list of passenger cells.
Returns:

The initial states distribution.

Features

The features in MushroomRL are 1-D arrays computed applying a specified function to a raw input, e.g. polynomial features of the state of an MDP. MushroomRL supports three types of features:

  • basis functions;
  • tensor basis functions;
  • tiles.

The tensor basis functions are a PyTorch implementation of the standard basis functions. They are less straightforward than the standard ones, but they are faster to compute as they can exploit parallel computing, e.g. GPU-acceleration and multi-core systems.

All the types of features are exposed by a single factory method Features that builds the one requested by the user.

Features(basis_list=None, tilings=None, tensor_list=None, n_outputs=None, function=None)[source]

Factory method to build the requested type of features. The types are mutually exclusive.

Possible features are tilings (tilings), basis functions (basis_list), tensor basis (tensor_list), and functional mappings (n_outputs and function).

The difference between basis_list and tensor_list is that the former is a list of python classes each one evaluating a single element of the feature vector, while the latter consists in a list of PyTorch modules that can be used to build a PyTorch network. The use of tensor_list is a faster way to compute features than basis_list and is suggested when the computation of the requested features is slow (see the Gaussian radial basis function implementation as an example). A functional mapping applies a function to the input computing an n_outputs-dimensional vector, where the mapping is expressed by function. If function is not provided, the identity is used.

Parameters:
  • basis_list (list, None) – list of basis functions;
  • tilings ([object, list], None) – single object or list of tilings;
  • tensor_list (list, None) – list of dictionaries containing the instructions to build the requested tensors;
  • n_outputs (int, None) – dimensionality of the feature mapping;
  • function (object, None) – a callable function to be used as feature mapping. Only needed when using a functional mapping.
Returns:

The class implementing the requested type of features.

get_action_features(phi_state, action, n_actions)[source]

Compute an array of size len(phi_state) * n_actions filled with zeros, except for elements from len(phi_state) * action to len(phi_state) * (action + 1) that are filled with phi_state. This is used to compute state-action features.

Parameters:
  • phi_state (np.ndarray) – the feature of the state;
  • action (np.ndarray) – the action whose features have to be computed;
  • n_actions (int) – the number of actions.
Returns:

The state-action features.

The factory method returns a class that extends the abstract class FeatureImplementation.

The documentation for every feature type can be found here:

Basis

Fourier
class FourierBasis(low, delta, c, dimensions=None)[source]

Bases: object

Class implementing Fourier basis functions. The value of the feature is computed using the formula:

\[\sum \cos{\pi(X - m)/\Delta c}\]

where X is the input, m is the vector of the minumum input values (for each dimensions) , Delta is the vector of maximum

__init__(low, delta, c, dimensions=None)[source]

Constructor.

Parameters:
  • low (np.ndarray) – vector of minimum values of the input variables;
  • delta (np.ndarray) – vector of the maximum difference between two values of the input variables, i.e. delta = high - low;
  • c (np.ndarray) – vector of weights for the state variables;
  • dimensions (list, None) – list of the dimensions of the input to be considered by the feature.
__call__(x)[source]

Call self as a function.

static generate(low, high, n, dimensions=None)[source]

Factory method to build a set of fourier basis.

Parameters:
  • low (np.ndarray) – vector of minimum values of the input variables;
  • high (np.ndarray) – vector of maximum values of the input variables;
  • n (int) – number of harmonics to consider for each state variable
  • dimensions (list, None) – list of the dimensions of the input to be considered by the features.
Returns:

The list of the generated fourier basis functions.

Gaussian RBF
class GaussianRBF(mean, scale, dimensions=None)[source]

Bases: object

Class implementing Gaussian radial basis functions. The value of the feature is computed using the formula:

\[\sum \dfrac{(X_i - \mu_i)^2}{\sigma_i}\]

where X is the input, mu is the mean vector and sigma is the scale parameter vector.

__init__(mean, scale, dimensions=None)[source]

Constructor.

Parameters:
  • mean (np.ndarray) – the mean vector of the feature;
  • scale (np.ndarray) – the scale vector of the feature;
  • dimensions (list, None) – list of the dimensions of the input to be considered by the feature. The number of dimensions must match the dimensionality of mean and scale.
__call__(x)[source]

Call self as a function.

static generate(n_centers, low, high, dimensions=None)[source]

Factory method to build uniformly spaced gaussian radial basis functions with a 25% overlap.

Parameters:
  • n_centers (list) – list of the number of radial basis functions to be used for each dimension.
  • low (np.ndarray) – lowest value for each dimension;
  • high (np.ndarray) – highest value for each dimension;
  • dimensions (list, None) – list of the dimensions of the input to be considered by the feature. The number of dimensions must match the number of elements in n_centers and low.
Returns:

The list of the generated radial basis functions.

Polynomial
class PolynomialBasis(dimensions=None, degrees=None)[source]

Bases: object

Class implementing polynomial basis functions. The value of the feature is computed using the formula:

\[\prod X_i^{d_i}\]

where X is the input and d is the vector of the exponents of the polynomial.

__init__(dimensions=None, degrees=None)[source]

Constructor. If both parameters are None, the constant feature is built.

Parameters:
  • dimensions (list, None) – list of the dimensions of the input to be considered by the feature;
  • degrees (list, None) – list of the degrees of each dimension to be considered by the feature. It must match the number of elements of dimensions.
__call__(x)[source]

Call self as a function.

static _compute_exponents(order, n_variables)[source]

Find the exponents of a multivariate polynomial expression of order order and n_variables number of variables.

Parameters:
  • order (int) – the maximum order of the polynomial;
  • n_variables (int) – the number of elements of the input vector.
Yields:

The current exponent of the polynomial.

static generate(max_degree, input_size)[source]

Factory method to build a polynomial of order max_degree based on the first input_size dimensions of the input.

Parameters:
  • max_degree (int) – maximum degree of the polynomial;
  • input_size (int) – size of the input.
Returns:

The list of the generated polynomial basis functions.

Tensors

Gaussian tensor
class GaussianRBFTensor(mu, scale, dim, use_cuda)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Pytorch module to implement a gaussian radial basis function.

__init__(mu, scale, dim, use_cuda)[source]

Constructor.

Parameters:
  • mu (np.ndarray) – centers of the gaussian RBFs;
  • scale (np.ndarray) – scales for the RBFs;
  • dim (np.ndarray) – list of dimension to be considered for the computation of the features;
  • use_cuda (bool) – whether to use cuda for the computation or not.
static generate(n_centers, low, high, dimensions=None, use_cuda=False)[source]

Factory method that generates the list of dictionaries to build the tensors representing a set of uniformly spaced Gaussian radial basis functions with a 25% overlap.

Parameters:
  • n_centers (list) – list of the number of radial basis functions to be used for each dimension;
  • low (np.ndarray) – lowest value for each dimension;
  • high (np.ndarray) – highest value for each dimension;
  • dimensions (list, None) – list of the dimensions of the input to be considered by the feature. The number of dimensions must match the number of elements in n_centers and low;
  • use_cuda (bool) – whether to use cuda for the computation or not.
Returns:

The list of dictionaries as described above.

Tiles

Rectangular Tiles
class Tiles(x_range, n_tiles, state_components=None)[source]

Bases: object

Class implementing rectangular tiling. For each point in the state space, this class can be used to compute the index of the corresponding tile.

__init__(x_range, n_tiles, state_components=None)[source]

Constructor.

Parameters:
  • x_range (list) – list of two-elements lists specifying the range of each state variable;
  • n_tiles (list) – list of the number of tiles to be used for each dimension.
  • state_components (list, None) – list of the dimensions of the input to be considered by the tiling. The number of elements must match the number of elements in x_range and n_tiles.
__call__(x)[source]

Call self as a function.

static generate(n_tilings, n_tiles, low, high, uniform=False)[source]

Factory method to build n_tilings tilings of n_tiles tiles with a range between low and high for each dimension.

Parameters:
  • n_tilings (int) – number of tilings, or -1 to compute the number automatically;
  • n_tiles (list) – number of tiles for each tilings for each dimension;
  • low (np.ndarray) – lowest value for each dimension;
  • high (np.ndarray) – highest value for each dimension.
  • uniform (bool, False) – if True the displacement for each tiling will be w/n_tilings, where w is the tile width. Otherwise, the displacement will be k*w/n_tilings, where k=2i+1, where i is the dimension index.
Returns:

The list of the generated tiles.

Voronoi Tiles
class VoronoiTiles(prototypes)[source]

Bases: object

Class implementing voronoi tiling. For each point in the state space, this class can be used to compute the index of the corresponding tile.

__init__(prototypes)[source]

Constructor.

Parameters:prototypes (list) – list of prototypes to compute the partition.
__call__(x)[source]

Call self as a function.

static generate(n_tilings, n_prototypes, low=None, high=None, mu=None, sigma=None)[source]

Factory method to build n_tilings tilings of n_prototypes. Prototypes are generated randomly sampled. If low and high are provided, prototypes are sampled uniformly between low and high, otherwise mu and sigma must be specified and prototypes are sampled from the corresponding Gaussian.

Parameters:
  • n_tilings (int) – number of tilings, or -1 to compute the number automatically;
  • n_prototypes (list) – number of prototypes for each tiling;
  • low (np.ndarray, None) – lowest value for each dimension, needed for uniform sampling;
  • high (np.ndarray, None) – highest value for each dimension, needed for uniform sampling.
  • mu (np.ndarray, None) – mean value for each dimension, needed for Gaussian sampling.
  • sigma (np.ndarray, None) – variance along each dimension, needed for Gaussian sampling.
Returns:

The list of the generated tiles.

Policy

class Policy[source]

Bases: mushroom_rl.core.serialization.Serializable

Interface representing a generic policy. A policy is a probability distribution that gives the probability of taking an action given a specified state. A policy is used by mushroom agents to interact with the environment.

__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

__init__

Initialize self. See help(type(self)) for accurate signature.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class ParametricPolicy[source]

Bases: mushroom_rl.policy.policy.Policy

Interface for a generic parametric policy. A parametric policy is a policy that depends on set of parameters, called the policy weights. If the policy is differentiable, the derivative of the probability for a specified state-action pair can be provided.

diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

diff(state, action)[source]

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
__init__

Initialize self. See help(type(self)) for accurate signature.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Deterministic policy

class DeterministicPolicy(mu)[source]

Bases: mushroom_rl.policy.policy.ParametricPolicy

Simple parametric policy representing a deterministic policy. As deterministic policies are degenerate probability functions where all the probability mass is on the deterministic action,they are not differentiable, even if the mean value approximator is differentiable.

__init__(mu)[source]

Constructor.

Parameters:mu (Regressor) – the regressor representing the action to select in each state.
get_regressor()[source]

Getter.

Returns:The regressor that is used to map state to actions.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

diff_log(state, action)

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Gaussian policy

class AbstractGaussianPolicy[source]

Bases: mushroom_rl.policy.policy.ParametricPolicy

Abstract class of Gaussian policies.

__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
__init__

Initialize self. See help(type(self)) for accurate signature.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

diff_log(state, action)

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

get_weights()

Getter.

Returns:The current policy weights.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_weights(weights)

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
weights_size

Property.

Returns:The size of the policy weights.
class GaussianPolicy(mu, sigma)[source]

Bases: mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy

Gaussian policy. This is a differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed.

__init__(mu, sigma)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • sigma (np.ndarray) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality.
set_sigma(sigma)[source]

Setter.

Parameters:sigma (np.ndarray) – the new covariance matrix. Must be a square positive definite matrix.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class DiagonalGaussianPolicy(mu, std)[source]

Bases: mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation vector. This is a differentiable policy for continuous action spaces. This policy is similar to the gaussian policy, but the weights includes also the standard deviation.

__init__(mu, std)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • std (np.ndarray) – a vector of standard deviations. The length of this vector must be equal to the action dimensionality.
set_std(std)[source]

Setter.

Parameters:std (np.ndarray) – the new standard deviation. Must be a square positive definite matrix.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class StateStdGaussianPolicy(mu, std, eps=1e-06)[source]

Bases: mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation, which is computed for each state. This is a differentiable policy for continuous action spaces. This policy is similar to the diagonal gaussian policy, but a parametric regressor is used to compute the standard deviation, so the standard deviation depends on the current state.

__init__(mu, std, eps=1e-06)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • std (Regressor) – the regressor representing the standard deviations w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality;
  • eps (float, 1e-6) – A positive constant added to the variance to ensure that is always greater than zero.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class StateLogStdGaussianPolicy(mu, log_std)[source]

Bases: mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, the diagonal is computed by an exponential transformation of the logarithm of the standard deviation computed in each state. This is a differentiable policy for continuous action spaces. This policy is similar to the State std gaussian policy, but here the regressor represents the logarithm of the standard deviation.

__init__(mu, log_std)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • log_std (Regressor) – a regressor representing the logarithm of the variance w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Noise policy

class OrnsteinUhlenbeckPolicy(mu, sigma, theta, dt, x0=None)[source]

Bases: mushroom_rl.policy.policy.ParametricPolicy

Ornstein-Uhlenbeck process as implemented in: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py.

This policy is commonly used in the Deep Deterministic Policy Gradient algorithm.

__init__(mu, sigma, theta, dt, x0=None)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • sigma (np.ndarray) – average magnitude of the random flactations per square-root time;
  • theta (float) – rate of mean reversion;
  • dt (float) – time interval;
  • x0 (np.ndarray, None) – initial values of noise.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

diff_log(state, action)

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class ClippedGaussianPolicy(mu, sigma, low, high)[source]

Bases: mushroom_rl.policy.policy.ParametricPolicy

Clipped Gaussian policy, as used in:

“Addressing Function Approximation Error in Actor-Critic Methods”. Fujimoto S. et al.. 2018.

This is a non-differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed. The action is then clipped using the given action range. This policy is not a truncated Gaussian, as it simply clips the action if the value is bigger than the boundaries. Thus, the non-differentiability.

__init__(mu, sigma, low, high)[source]

Constructor.

Parameters:
  • mu (Regressor) – the regressor representing the mean w.r.t. the state;
  • sigma (np.ndarray) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality;
  • low (np.ndarray) – a vector containing the minimum action for each component;
  • high (np.ndarray) – a vector containing the maximum action for each component.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
weights_size

Property.

Returns:The size of the policy weights.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the derivative is computed
  • action (np.ndarray) – the action where the derivative is computed
Returns:

The derivative w.r.t. the policy weights

diff_log(state, action)

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]
Parameters:
  • state (np.ndarray) – the state where the gradient is computed
  • action (np.ndarray) – the action where the gradient is computed
Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

TD policy

class TDPolicy[source]

Bases: mushroom_rl.policy.policy.Policy

__init__()[source]

Constructor.

set_q(approximator)[source]
Parameters:approximator (object) – the approximator to use.
get_q()[source]
Returns:The approximator used by the policy.
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class EpsGreedy(epsilon)[source]

Bases: mushroom_rl.policy.td_policy.TDPolicy

Epsilon greedy policy.

__init__(epsilon)[source]

Constructor.

Parameters:epsilon ([float, Parameter]) – the exploration coefficient. It indicates the probability of performing a random actions in the current step.
__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
set_epsilon(epsilon)[source]

Setter.

Parameters:
  • epsilon ([float, Parameter]) – the exploration coefficient. It indicates the
  • of performing a random actions in the current step. (probability) –
update(*idx)[source]

Update the value of the epsilon parameter at the provided index (e.g. in case of different values of epsilon for each visited state according to the number of visits).

Parameters:*idx (list) – index of the parameter to be updated.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_q()
Returns:The approximator used by the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_q(approximator)
Parameters:approximator (object) – the approximator to use.
class Boltzmann(beta)[source]

Bases: mushroom_rl.policy.td_policy.TDPolicy

Boltzmann softmax policy.

__init__(beta)[source]

Constructor.

Parameters:
  • beta ([float, Parameter]) – the inverse of the temperature distribution. As
  • temperature approaches infinity, the policy becomes more and (the) –
  • random. As the temperature approaches 0.0, the policy becomes (more) –
  • and more greedy. (more) –
__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
set_beta(beta)[source]

Setter.

Parameters:beta ((float, Parameter)) – the inverse of the temperature distribution.
update(*idx)[source]

Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).

Parameters:*idx (list) – index of the parameter to be updated.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_q()
Returns:The approximator used by the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_q(approximator)
Parameters:approximator (object) – the approximator to use.
class Mellowmax(omega, beta_min=-10.0, beta_max=10.0)[source]

Bases: mushroom_rl.policy.td_policy.Boltzmann

Mellowmax policy. “An Alternative Softmax Operator for Reinforcement Learning”. Asadi K. and Littman M.L.. 2017.

class MellowmaxParameter(outer, omega, beta_min, beta_max)[source]

Bases: mushroom_rl.utils.parameters.Parameter

__init__(outer, omega, beta_min, beta_max)[source]

Constructor.

Parameters:
  • value (float) – initial value of the parameter;
  • min_value (float, None) – minimum value that the parameter can reach when decreasing;
  • max_value (float, None) – maximum value that the parameter can reach when increasing;
  • size (tuple, (1,)) – shape of the matrix of parameters; this shape can be used to have a single parameter for each state or state-action tuple.
__call__(state)[source]

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_compute(*idx, **kwargs)
Returns:The value of the parameter in the provided index.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the number of visit of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter whose number of visits has to be updated.
__init__(omega, beta_min=-10.0, beta_max=10.0)[source]

Constructor.

Parameters:
  • omega (Parameter) – the omega parameter of the policy from which beta of the Boltzmann policy is computed;
  • beta_min (float, -10.) – one end of the bracketing interval for minimization with Brent’s method;
  • beta_max (float, 10.) – the other end of the bracketing interval for minimization with Brent’s method.
set_beta(beta)[source]

Setter.

Parameters:beta ((float, Parameter)) – the inverse of the temperature distribution.
update(*idx)[source]

Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).

Parameters:*idx (list) – index of the parameter to be updated.
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
get_q()
Returns:The approximator used by the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
set_q(approximator)
Parameters:approximator (object) – the approximator to use.

Torch policy

class TorchPolicy(use_cuda)[source]

Bases: mushroom_rl.policy.policy.Policy

Interface for a generic PyTorch policy. A PyTorch policy is a policy implemented as a neural network using PyTorch. Functions ending with ‘_t’ use tensors as input, and also as output when required.

__init__(use_cuda)[source]

Constructor.

Parameters:use_cuda (bool) – whether to use cuda or not.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
distribution(state)[source]

Compute the policy distribution in the given states.

Parameters:state (np.ndarray) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:state (np.ndarray, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The value of the entropy of the policy.
draw_action_t(state)[source]

Draw an action given a tensor.

Parameters:state (torch.Tensor) – set of states.
Returns:The tensor of the actions to perform in each state.
log_prob_t(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:
  • state (torch.Tensor) – set of states.
  • action (torch.Tensor) – set of actions.
Returns:

The tensor of log-probability.

entropy_t(state)[source]

Compute the entropy of the policy.

Parameters:state (torch.Tensor) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The tensor value of the entropy of the policy.
distribution_t(state)[source]

Compute the policy distribution in the given states.

Parameters:state (torch.Tensor) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:List of parameters to be optimized.
reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

use_cuda

True if the policy is using cuda_tensors.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class GaussianTorchPolicy(network, input_shape, output_shape, std_0=1.0, use_cuda=False, **params)[source]

Bases: mushroom_rl.policy.torch_policy.TorchPolicy

Torch policy implementing a Gaussian policy with trainable standard deviation. The standard deviation is not state-dependent.

__init__(network, input_shape, output_shape, std_0=1.0, use_cuda=False, **params)[source]

Constructor.

Parameters:
  • network (object) – the network class used to implement the mean regressor;
  • input_shape (tuple) – the shape of the state space;
  • output_shape (tuple) – the shape of the action space;
  • std_0 (float, 1.) – initial standard deviation;
  • params (dict) – parameters used by the network constructor.
draw_action_t(state)[source]

Draw an action given a tensor.

Parameters:state (torch.Tensor) – set of states.
Returns:The tensor of the actions to perform in each state.
log_prob_t(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:
  • state (torch.Tensor) – set of states.
  • action (torch.Tensor) – set of actions.
Returns:

The tensor of log-probability.

entropy_t(state=None)[source]

Compute the entropy of the policy.

Parameters:state (torch.Tensor) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The tensor value of the entropy of the policy.
distribution_t(state)[source]

Compute the policy distribution in the given states.

Parameters:state (torch.Tensor) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:List of parameters to be optimized.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
distribution(state)

Compute the policy distribution in the given states.

Parameters:state (np.ndarray) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
entropy(state=None)

Compute the entropy of the policy.

Parameters:state (np.ndarray, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The value of the entropy of the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
use_cuda

True if the policy is using cuda_tensors.

class BoltzmannTorchPolicy(network, input_shape, output_shape, beta, use_cuda=False, **params)[source]

Bases: mushroom_rl.policy.torch_policy.TorchPolicy

Torch policy implementing a Boltzmann policy.

__init__(network, input_shape, output_shape, beta, use_cuda=False, **params)[source]

Constructor.

Parameters:
  • network (object) – the network class used to implement the mean regressor;
  • input_shape (tuple) – the shape of the state space;
  • output_shape (tuple) – the shape of the action space;
  • beta ((float, Parameter)) – the inverse of the temperature distribution. As the temperature approaches infinity, the policy becomes more and more random. As the temperature approaches 0.0, the policy becomes more and more greedy.
  • params (dict) – parameters used by the network constructor.
draw_action_t(state)[source]

Draw an action given a tensor.

Parameters:state (torch.Tensor) – set of states.
Returns:The tensor of the actions to perform in each state.
log_prob_t(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:
  • state (torch.Tensor) – set of states.
  • action (torch.Tensor) – set of actions.
Returns:

The tensor of log-probability.

entropy_t(state)[source]

Compute the entropy of the policy.

Parameters:state (torch.Tensor) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The tensor value of the entropy of the policy.
distribution_t(state)[source]

Compute the policy distribution in the given states.

Parameters:state (torch.Tensor) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
set_weights(weights)[source]

Setter.

Parameters:weights (np.ndarray) – the vector of the new weights to be used by the policy.
get_weights()[source]

Getter.

Returns:The current policy weights.
parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:List of parameters to be optimized.
__call__(state, action)

Compute the probability of taking action in a certain state following the policy.

Parameters:*args (list) – list containing a state or a state and an action.
Returns:The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
distribution(state)

Compute the policy distribution in the given states.

Parameters:state (np.ndarray) – the set of states where the distribution is computed.
Returns:The torch distribution for the provided states.
draw_action(state)

Sample an action in state using the policy.

Parameters:state (np.ndarray) – the state where the agent is.
Returns:The action sampled from the policy.
entropy(state=None)

Compute the entropy of the policy.

Parameters:state (np.ndarray, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:The value of the entropy of the policy.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
use_cuda

True if the policy is using cuda_tensors.

Solvers

Dynamic programming

value_iteration(prob, reward, gamma, eps)[source]

Value iteration algorithm to solve a dynamic programming problem.

Parameters:
  • prob (np.ndarray) – transition probability matrix;
  • reward (np.ndarray) – reward matrix;
  • gamma (float) – discount factor;
  • eps (float) – accuracy threshold.
Returns:

The optimal value of each state.

policy_iteration(prob, reward, gamma)[source]

Policy iteration algorithm to solve a dynamic programming problem.

Parameters:
  • prob (np.ndarray) – transition probability matrix;
  • reward (np.ndarray) – reward matrix;
  • gamma (float) – discount factor.
Returns:

The optimal value of each state and the optimal policy.

Car-On-Hill brute-force solver

step(mdp, state, action)[source]

Perform a step in the tree.

Parameters:
  • mdp (CarOnHill) – the Car-On-Hill environment;
  • state (np.array) – the state;
  • action (np.array) – the action.
Returns:

The resulting transition executing action in state.

bfs(mdp, frontier, k, max_k)[source]

Perform Breadth-First tree search.

Parameters:
  • mdp (CarOnHill) – the Car-On-Hill environment;
  • frontier (list) – the state at the frontier of the BFS;
  • k (int) – the current depth of the tree;
  • max_k (int) – maximum depth to consider.
Returns:

A tuple containing a flag for the algorithm ending, and the updated depth of the tree.

solve_car_on_hill(mdp, states, actions, gamma, max_k=50)[source]

Solver of the Car-On-Hill environment.

Parameters:
  • mdp (CarOnHill) – the Car-On-Hill environment;
  • states (np.ndarray) – the states;
  • actions (np.ndarray) – the actions;
  • gamma (float) – the discount factor;
  • max_k (int, 50) – maximum depth to consider.
Returns:

The Q-value for each state-action tuple.

LQR solver

compute_lqr_feedback_gain(lqr, max_iterations=100)[source]

Computes the optimal gain matrix K.

Parameters:
  • lqr (LQR) – LQR environment;
  • max_iterations (int) – max iterations for convergence.
Returns:

Feedback gain matrix K.

compute_lqr_P(lqr, K)[source]

Computes the P matrix for a given gain matrix K.

Parameters:
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix.
Returns:

The P matrix of the value function.

compute_lqr_V(s, lqr, K)[source]

Computes the value function at a state s, with the given gain matrix K.

Parameters:
  • s (np.ndarray) – state;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix.
Returns:

The value function at s

compute_lqr_V_gaussian_policy(s, lqr, K, Sigma)[source]

Computes the value function at a state s, with the given gain matrix K and covariance Sigma.

Parameters:
  • s (np.ndarray) – state;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix;
  • Sigma (np.ndarray) – covariance matrix.
Returns:

The value function at s.

compute_lqr_Q(s, a, lqr, K)[source]

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K.

Parameters:
  • s (np.ndarray) – state;
  • a (np.ndarray) – action;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix.
Returns:

The Q function at s, a.

compute_lqr_Q_gaussian_policy(s, a, lqr, K, Sigma)[source]

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K and covariance Sigma.

Parameters:
  • s (np.ndarray) – state;
  • a (np.ndarray) – action;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix;
  • Sigma (np.ndarray) – covariance matrix.
Returns:

The Q function at (s, a).

compute_lqr_V_gaussian_policy_gradient_K(s, lqr, K, Sigma)[source]

Computes the gradient of the objective function J (equal to the value function V) at state s, w.r.t. the controller matrix K, with the current policy parameters K and Sigma. J(s, K, Sigma) = ValueFunction(s, K, Sigma).

Parameters:
  • s (np.ndarray) – state;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix;
  • Sigma (np.ndarray) – covariance matrix.
Returns:

The gradient of J wrt to K.

compute_lqr_Q_gaussian_policy_gradient_K(s, a, lqr, K, Sigma)[source]

Computes the gradient of the state-action Value function at state-action pair (s, a), w.r.t. the controller matrix K, with the current policy parameters K and Sigma.

Parameters:
  • s (np.ndarray) – state;
  • a (np.ndarray) – action;
  • lqr (LQR) – LQR environment;
  • K (np.ndarray) – controller matrix;
  • Sigma (np.ndarray) – covariance matrix.
Returns:

The gradient of Q wrt to K.

Utils

Angles

normalize_angle_positive(angle)[source]

Wrap the angle between 0 and 2 * pi.

Parameters:angle (float) – angle to wrap.
Returns:The wrapped angle.
normalize_angle(angle)[source]

Wrap the angle between -pi and pi.

Parameters:angle (float) – angle to wrap.
Returns:The wrapped angle.
shortest_angular_distance(from_angle, to_angle)[source]

Compute the shortest distance between two angles

Parameters:
  • from_angle (float) – starting angle;
  • to_angle (float) – final angle.
Returns:

The shortest distance between from_angle and to_angle.

quat_to_euler(quat)[source]

Convert a quaternion to euler angles.

Parameters:quat (np.ndarray) – quaternion to be converted, must be in format [w, x, y, z]
Returns:The euler angles [x, y, z] representation of the quaternion
euler_to_quat(euler)[source]

Convert euler angles into a quaternion.

Parameters:euler (np.ndarray) – euler angles to be converted
Returns:Quaternion in format [w, x, y, z]

Callbacks

class Callback[source]

Bases: object

Interface for all basic callbacks. Implements a list in which it is possible to store data and methods to query and clean the content stored by the callback.

__init__()[source]

Constructor.

__call__(dataset)[source]

Add samples to the samples list.

Parameters:dataset (list) – the samples to collect.
get()[source]
Returns:The current collected data as a list.
clean()[source]

Delete the current stored data list

class CollectDataset[source]

Bases: mushroom_rl.utils.callbacks.callback.Callback

This callback can be used to collect samples during the learning of the agent.

__call__(dataset)[source]

Add samples to the samples list.

Parameters:dataset (list) – the samples to collect.
class CollectQ(approximator)[source]

Bases: mushroom_rl.utils.callbacks.callback.Callback

This callback can be used to collect the action values in all states at the current time step.

__init__(approximator)[source]

Constructor.

Parameters:approximator ([Table, EnsembleTable]) – the approximator to use to predict the action values.
__call__(dataset)[source]

Add samples to the samples list.

Parameters:dataset (list) – the samples to collect.
class CollectMaxQ(approximator, state)[source]

Bases: mushroom_rl.utils.callbacks.callback.Callback

This callback can be used to collect the maximum action value in a given state at each call.

__init__(approximator, state)[source]

Constructor.

Parameters:
  • approximator ([Table, EnsembleTable]) – the approximator to use;
  • state (np.ndarray) – the state to consider.
__call__(dataset)[source]

Add samples to the samples list.

Parameters:dataset (list) – the samples to collect.
class CollectParameters(parameter, *idx)[source]

Bases: mushroom_rl.utils.callbacks.callback.Callback

This callback can be used to collect the values of a parameter (e.g. learning rate) during a run of the agent.

__init__(parameter, *idx)[source]

Constructor.

Parameters:
  • parameter (Parameter) – the parameter whose values have to be collected;
  • *idx (list) – index of the parameter when the parameter is tabular.
__call__(dataset)[source]

Add samples to the samples list.

Parameters:dataset (list) – the samples to collect.

Dataset

parse_dataset(dataset, features=None)[source]

Split the dataset in its different components and return them.

Parameters:
  • dataset (list) – the dataset to parse;
  • features (object, None) – features to apply to the states.
Returns:

The np.ndarray of state, action, reward, next_state, absorbing flag and last step flag. Features are applied to state and next_state, when provided.

arrays_as_dataset(states, actions, rewards, next_states, absorbings, lasts)[source]

Creates a dataset of transitions from the provided arrays.

Parameters:
  • states (np.ndarray) – array of states;
  • actions (np.ndarray) – array of actions;
  • rewards (np.ndarray) – array of rewards;
  • next_states (np.ndarray) – array of next_states;
  • absorbings (np.ndarray) – array of absorbing flags;
  • lasts (np.ndarray) – array of last flags.
Returns:

The list of transitions.

episodes_length(dataset)[source]

Compute the length of each episode in the dataset.

Parameters:dataset (list) – the dataset to consider.
Returns:A list of length of each episode in the dataset.
select_first_episodes(dataset, n_episodes, parse=False)[source]

Return the first n_episodes episodes in the provided dataset.

Parameters:
  • dataset (list) – the dataset to consider;
  • n_episodes (int) – the number of episodes to pick from the dataset;
  • parse (bool, False) – whether to parse the dataset to return.
Returns:

A subset of the dataset containing the first n_episodes episodes.

select_random_samples(dataset, n_samples, parse=False)[source]

Return the randomly picked desired number of samples in the provided dataset.

Parameters:
  • dataset (list) – the dataset to consider;
  • n_samples (int) – the number of samples to pick from the dataset;
  • parse (bool, False) – whether to parse the dataset to return.
Returns:

A subset of the dataset containing randomly picked n_samples samples.

compute_J(dataset, gamma=1.0)[source]

Compute the cumulative discounted reward of each episode in the dataset.

Parameters:
  • dataset (list) – the dataset to consider;
  • gamma (float, 1.) – discount factor.
Returns:

The cumulative discounted reward of each episode in the dataset.

compute_metrics(dataset, gamma=1.0)[source]

Compute the metrics of each complete episode in the dataset.

Parameters:
  • dataset (list) – the dataset to consider;
  • gamma (float, 1.) – the discount factor.
Returns:

The minimum score reached in an episode, the maximum score reached in an episode, the mean score reached, the number of completed games.

If episode has not been completed, it returns 0 for all values.

Eligibility trace

EligibilityTrace(shape, name='replacing')[source]

Factory method to create an eligibility trace of the provided type.

Parameters:
  • shape (list) – shape of the eligibility trace table;
  • name (str, 'replacing') – type of the eligibility trace.
Returns:

The eligibility trace table of the provided shape and type.

class ReplacingTrace(shape, initial_value=0.0, dtype=None)[source]

Bases: mushroom_rl.utils.table.Table

Replacing trace.

reset()[source]
update(state, action)[source]
__init__(shape, initial_value=0.0, dtype=None)

Constructor.

Parameters:
  • shape (tuple) – the shape of the tabular regressor.
  • initial_value (float, 0.) – the initial value for each entry of the tabular regressor.
  • dtype ([int, float], None) – the dtype of the table array.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _append_folder(folder, name)
static _get_serialization_method(class_name)
static _load_json(zip_file, name)
classmethod _load_list(zip_file, folder, length)
static _load_mushroom(zip_file, name)
static _load_numpy(zip_file, name)
static _load_pickle(zip_file, name)
static _load_torch(zip_file, name)
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

static _save_json(zip_file, name, obj, folder, **_)
static _save_mushroom(zip_file, name, obj, folder, full_save)
static _save_numpy(zip_file, name, obj, folder, **_)
static _save_pickle(zip_file, name, obj, folder, **_)
static _save_torch(zip_file, name, obj, folder, **_)
copy()
Returns:A deepcopy of the agent.
fit(x, y)
Parameters:
  • x (int) – index of the table to be filled;
  • y (float) – value to fill in the table.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
classmethod load_zip(zip_file, folder='')
n_actions

The number of actions considered by the table.

Type:Returns
predict(*z)

Predict the output of the table given an input.

Parameters:
  • *z (list) – list of input of the model. If the table is a Q-table,
  • list may contain states or states and actions depending (this) – on whether the call requires to predict all q-values or only one q-value corresponding to the provided action;
Returns:

The table prediction.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table.

Type:Returns
class AccumulatingTrace(shape, initial_value=0.0, dtype=None)[source]

Bases: mushroom_rl.utils.table.Table

Accumulating trace.

reset()[source]
update(state, action)[source]
__init__(shape, initial_value=0.0, dtype=None)

Constructor.

Parameters:
  • shape (tuple) – the shape of the tabular regressor.
  • initial_value (float, 0.) – the initial value for each entry of the tabular regressor.
  • dtype ([int, float], None) – the dtype of the table array.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
static _append_folder(folder, name)
static _get_serialization_method(class_name)
static _load_json(zip_file, name)
classmethod _load_list(zip_file, folder, length)
static _load_mushroom(zip_file, name)
static _load_numpy(zip_file, name)
static _load_pickle(zip_file, name)
static _load_torch(zip_file, name)
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

static _save_json(zip_file, name, obj, folder, **_)
static _save_mushroom(zip_file, name, obj, folder, full_save)
static _save_numpy(zip_file, name, obj, folder, **_)
static _save_pickle(zip_file, name, obj, folder, **_)
static _save_torch(zip_file, name, obj, folder, **_)
copy()
Returns:A deepcopy of the agent.
fit(x, y)
Parameters:
  • x (int) – index of the table to be filled;
  • y (float) – value to fill in the table.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
classmethod load_zip(zip_file, folder='')
n_actions

The number of actions considered by the table.

Type:Returns
predict(*z)

Predict the output of the table given an input.

Parameters:
  • *z (list) – list of input of the model. If the table is a Q-table,
  • list may contain states or states and actions depending (this) – on whether the call requires to predict all q-values or only one q-value corresponding to the provided action;
Returns:

The table prediction.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table.

Type:Returns

Features

uniform_grid(n_centers, low, high)[source]

This function is used to create the parameters of uniformly spaced radial basis functions with 25% of overlap. It creates a uniformly spaced grid of n_centers[i] points in each ranges[i]. Also returns a vector containing the appropriate scales of the radial basis functions.

Parameters:
  • n_centers (list) – number of centers of each dimension;
  • low (np.ndarray) – lowest value for each dimension;
  • high (np.ndarray) – highest value for each dimension.
Returns:

The uniformly spaced grid and the scale vector.

Folder

mk_dir_recursive(dir_path)[source]

Create a directory and, if needed, all the directory tree. Differently from os.mkdir, this function does not raise exception when the directory already exists.

Parameters:dir_path (str) – the path of the directory to create.

Create a symlink deleting the previous one, if it already exists.

Parameters:
  • src (str) – source;
  • dst (str) – destination.

Frames

class LazyFrames(frames, history_length)[source]

Bases: object

From OpenAI Baseline. https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py

This class provides a solution to optimize the use of memory when concatenating different frames, e.g. Atari frames in DQN. The frames are individually stored in a list and, when numpy arrays containing them are created, the reference to each frame is used instead of a copy.

__init__(frames, history_length)[source]

Initialize self. See help(type(self)) for accurate signature.

preprocess_frame(obs, img_size)[source]

Convert a frame from rgb to grayscale and resize it.

Parameters:
  • obs (np.ndarray) – array representing an rgb frame;
  • img_size (tuple) – target size for images.
Returns:

The transformed frame as 8 bit integer array.

Minibatches

minibatch_number(size, batch_size)[source]

Function to retrieve the number of batches, given a batch sizes.

Parameters:
  • size (int) – size of the dataset;
  • batch_size (int) – size of the batches.
Returns:

The number of minibatches in the dataset.

minibatch_generator(batch_size, *dataset)[source]

Generator that creates a minibatch from the full dataset.

Parameters:
  • batch_size (int) – the maximum size of each minibatch;
  • dataset – the dataset to be splitted.
Returns:

The current minibatch.

Numerical gradient

numerical_diff_policy(policy, state, action, eps=1e-06)[source]

Compute the gradient of a policy in (state, action) numerically.

Parameters:
  • policy (Policy) – the policy whose gradient has to be returned;
  • state (np.ndarray) – the state;
  • action (np.ndarray) – the action;
  • eps (float, 1e-6) – the value of the perturbation.
Returns:

The gradient of the provided policy in (state, action) computed numerically.

numerical_diff_dist(dist, theta, eps=1e-06)[source]

Compute the gradient of a distribution in theta numerically.

Parameters:
  • dist (Distribution) – the distribution whose gradient has to be returned;
  • theta (np.ndarray) – the parametrization where to compute the gradient;
  • eps (float, 1e-6) – the value of the perturbation.
Returns:

The gradient of the provided distribution theta computed numerically.

numerical_diff_function(function, params, eps=1e-06)[source]

Compute the gradient of a function in theta numerically.

Parameters:
  • function – a function whose gradient has to be returned;
  • params – parameter vector w.r.t. we need to compute the gradient;
  • eps (float, 1e-6) – the value of the perturbation.
Returns:

The numerical gradient of the function computed w.r.t. parameters params.

Parameters

class Parameter(value, min_value=None, max_value=None, size=(1, ))[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements function to manage parameters, such as learning rate. It also allows to have a single parameter for each state of state-action tuple.

__init__(value, min_value=None, max_value=None, size=(1, ))[source]

Constructor.

Parameters:
  • value (float) – initial value of the parameter;
  • min_value (float, None) – minimum value that the parameter can reach when decreasing;
  • max_value (float, None) – maximum value that the parameter can reach when increasing;
  • size (tuple, (1,)) – shape of the matrix of parameters; this shape can be used to have a single parameter for each state or state-action tuple.
__call__(*idx, **kwargs)[source]

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
get_value(*idx, **kwargs)[source]

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
_compute(*idx, **kwargs)[source]
Returns:The value of the parameter in the provided index.
update(*idx, **kwargs)[source]

Updates the number of visit of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter whose number of visits has to be updated.
shape

The shape of the table of parameters.

Type:Returns
initial_value

The initial value of the parameters.

Type:Returns
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class LinearParameter(value, threshold_value, n, size=(1, ))[source]

Bases: mushroom_rl.utils.parameters.Parameter

This class implements a linearly changing parameter according to the number of times it has been used.

__init__(value, threshold_value, n, size=(1, ))[source]

Constructor.

Parameters:
  • value (float) – initial value of the parameter;
  • min_value (float, None) – minimum value that the parameter can reach when decreasing;
  • max_value (float, None) – maximum value that the parameter can reach when increasing;
  • size (tuple, (1,)) – shape of the matrix of parameters; this shape can be used to have a single parameter for each state or state-action tuple.
_compute(*idx, **kwargs)[source]

Returns: The value of the parameter in the provided index.

__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the number of visit of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter whose number of visits has to be updated.
class ExponentialParameter(value, exp=1.0, min_value=None, max_value=None, size=(1, ))[source]

Bases: mushroom_rl.utils.parameters.Parameter

This class implements a exponentially changing parameter according to the number of times it has been used.

__init__(value, exp=1.0, min_value=None, max_value=None, size=(1, ))[source]

Constructor.

Parameters:
  • value (float) – initial value of the parameter;
  • min_value (float, None) – minimum value that the parameter can reach when decreasing;
  • max_value (float, None) – maximum value that the parameter can reach when increasing;
  • size (tuple, (1,)) – shape of the matrix of parameters; this shape can be used to have a single parameter for each state or state-action tuple.
_compute(*idx, **kwargs)[source]

Returns: The value of the parameter in the provided index.

__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the number of visit of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter whose number of visits has to be updated.

Replay memory

class ReplayMemory(initial_size, max_size)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements function to manage a replay memory as the one used in “Human-Level Control Through Deep Reinforcement Learning” by Mnih V. et al..

__init__(initial_size, max_size)[source]

Constructor.

Parameters:
  • initial_size (int) – initial number of elements in the replay memory;
  • max_size (int) – maximum number of elements that the replay memory can contain.
add(dataset, n_steps_return=1, gamma=1.0)[source]

Add elements to the replay memory.

Parameters:
  • dataset (list) – list of elements to add to the replay memory;
  • n_steps_return (int, 1) – number of steps to consider for computing n-step return;
  • gamma (float, 1.) – discount factor for n-step return.
get(n_samples)[source]

Returns the provided number of states from the replay memory. :param n_samples: the number of samples to return. :type n_samples: int

Returns:The requested number of samples.
reset()[source]

Reset the replay memory.

initialized

Whether the replay memory has reached the number of elements that allows it to be used.

Type:Returns
size

The number of elements contained in the replay memory.

Type:Returns
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class SumTree(max_size)[source]

Bases: object

This class implements a sum tree data structure. This is used, for instance, by PrioritizedReplayMemory.

__init__(max_size)[source]

Constructor.

Parameters:max_size (int) – maximum size of the tree.
add(dataset, priority, n_steps_return, gamma)[source]

Add elements to the tree.

Parameters:
  • dataset (list) – list of elements to add to the tree;
  • priority (np.ndarray) – priority of each sample in the dataset;
  • n_steps_return (int) – number of steps to consider for computing n-step return;
  • gamma (float) – discount factor for n-step return.
get(s)[source]

Returns the provided number of states from the replay memory.

Parameters:s (float) – the value of the samples to return.
Returns:The requested sample.
update(idx, priorities)[source]

Update the priority of the sample at the provided index in the dataset.

Parameters:
  • idx (np.ndarray) – indexes of the transitions in the dataset;
  • priorities (np.ndarray) – priorities of the transitions.
size

The current size of the tree.

Type:Returns
max_p

The maximum priority among the ones in the tree.

Type:Returns
total_p

The sum of the priorities in the tree, i.e. the value of the root node.

Type:Returns
class PrioritizedReplayMemory(initial_size, max_size, alpha, beta, epsilon=0.01)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements function to manage a prioritized replay memory as the one used in “Prioritized Experience Replay” by Schaul et al., 2015.

__init__(initial_size, max_size, alpha, beta, epsilon=0.01)[source]

Constructor.

Parameters:
  • initial_size (int) – initial number of elements in the replay memory;
  • max_size (int) – maximum number of elements that the replay memory can contain;
  • alpha (float) – prioritization coefficient;
  • beta ([float, Parameter]) – importance sampling coefficient;
  • epsilon (float, 01) – small value to avoid zero probabilities.
add(dataset, p, n_steps_return=1, gamma=1.0)[source]

Add elements to the replay memory.

Parameters:
  • dataset (list) – list of elements to add to the replay memory;
  • p (np.ndarray) – priority of each sample in the dataset.
  • n_steps_return (int, 1) – number of steps to consider for computing n-step return;
  • gamma (float, 1.) – discount factor for n-step return.
get(n_samples)[source]

Returns the provided number of states from the replay memory.

Parameters:n_samples (int) – the number of samples to return.
Returns:The requested number of samples.
update(error, idx)[source]

Update the priority of the sample at the provided index in the dataset.

Parameters:
  • error (np.ndarray) – errors to consider to compute the priorities;
  • idx (np.ndarray) – indexes of the transitions in the dataset.
initialized

Whether the replay memory has reached the number of elements that allows it to be used.

Type:Returns
max_priority

The maximum value of priority inside the replay memory.

Type:Returns
_post_load()[source]

This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Spaces

class Box(low, high, shape=None)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements functions to manage continuous states and action spaces. It is similar to the Box class in gym.spaces.box.

__init__(low, high, shape=None)[source]

Constructor.

Parameters:
  • low ([float, np.ndarray]) – the minimum value of each dimension of the space. If a scalar value is provided, this value is considered as the minimum one for each dimension. If a np.ndarray is provided, each i-th element is considered the minimum value of the i-th dimension;
  • high ([float, np.ndarray]) – the maximum value of dimensions of the space. If a scalar value is provided, this value is considered as the maximum one for each dimension. If a np.ndarray is provided, each i-th element is considered the maximum value of the i-th dimension;
  • shape (np.ndarray, None) – the dimension of the space. Must match the shape of low and high, if they are np.ndarray.
low

The minimum value of each dimension of the space.

Type:Returns
high

The maximum value of each dimension of the space.

Type:Returns
shape

The dimensions of the space.

Type:Returns
class Discrete(n)[source]

Bases: mushroom_rl.core.serialization.Serializable

This class implements functions to manage discrete states and action spaces. It is similar to the Discrete class in gym.spaces.discrete.

__init__(n)[source]

Constructor.

Parameters:n (int) – the number of values of the space.
size

The number of elements of the space.

Type:Returns
shape

The shape of the space that is always (1,).

Type:Returns

Table

class Table(shape, initial_value=0.0, dtype=None)[source]

Bases: mushroom_rl.core.serialization.Serializable

Table regressor. Used for discrete state and action spaces.

__init__(shape, initial_value=0.0, dtype=None)[source]

Constructor.

Parameters:
  • shape (tuple) – the shape of the tabular regressor.
  • initial_value (float, 0.) – the initial value for each entry of the tabular regressor.
  • dtype ([int, float], None) – the dtype of the table array.
fit(x, y)[source]
Parameters:
  • x (int) – index of the table to be filled;
  • y (float) – value to fill in the table.
predict(*z)[source]

Predict the output of the table given an input.

Parameters:
  • *z (list) – list of input of the model. If the table is a Q-table,
  • list may contain states or states and actions depending (this) – on whether the call requires to predict all q-values or only one q-value corresponding to the provided action;
Returns:

The table prediction.

n_actions

The number of actions considered by the table.

Type:Returns
shape

The shape of the table.

Type:Returns
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
class EnsembleTable(n_models, shape, **params)[source]

Bases: mushroom_rl.approximators._implementations.ensemble.Ensemble

This class implements functions to manage table ensembles.

__init__(n_models, shape, **params)[source]

Constructor.

Parameters:
  • n_models (int) – number of models in the ensemble;
  • shape (np.ndarray) – shape of each table in the ensemble.
  • **params – parameters dictionary to create each regressor.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
fit(*z, idx=None, **fit_params)

Fit the idx-th model of the ensemble if idx is provided, every model otherwise.

Parameters:
  • *z – a list containing the inputs to use to predict with each regressor of the ensemble;
  • idx (int, None) – index of the model to fit;
  • **fit_params – other params.
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
model

The list of the models in the ensemble.

Type:Returns
predict(*z, idx=None, prediction=None, compute_variance=False, **predict_params)

Predict.

Parameters:
  • *z – a list containing the inputs to use to predict with each regressor of the ensemble;
  • idx (int, None) – index of the model to use for prediction;
  • prediction (str, None) – the type of prediction to make. When provided, it overrides the prediction class attribute;
  • compute_variance (bool, False) – whether to compute the variance of the prediction or not;
  • **predict_params – other parameters used by the predict method the regressor.
Returns:

The predictions of the model.

reset()

Reset the model parameters.

save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.

Torch

set_weights(parameters, weights, use_cuda)[source]

Function used to set the value of a set of torch parameters given a vector of values.

Parameters:
  • parameters (list) – list of parameters to be considered;
  • weights (numpy.ndarray) – array of the new values for the parameters;
  • use_cuda (bool) – whether the parameters are cuda tensors or not;
get_weights(parameters)[source]

Function used to get the value of a set of torch parameters as a single vector of values.

Parameters:parameters (list) – list of parameters to be considered.
Returns:A numpy vector consisting of all the values of the vectors.
zero_grad(parameters)[source]

Function used to set to zero the value of the gradient of a set of torch parameters.

Parameters:parameters (list) – list of parameters to be considered.
get_gradient(params)[source]

Function used to get the value of the gradient of a set of torch parameters.

Parameters:parameters (list) – list of parameters to be considered.
to_float_tensor(x, use_cuda=False)[source]

Function used to convert a numpy array to a float torch tensor.

Parameters:
  • x (np.ndarray) – numpy array to be converted as torch tensor;
  • use_cuda (bool) – whether to build a cuda tensors or not.
Returns:

A float tensor build from the values contained in the input array.

to_int_tensor(x, use_cuda=False)[source]

Function used to convert a numpy array to a float torch tensor.

Parameters:
  • x (np.ndarray) – numpy array to be converted as torch tensor;
  • use_cuda (bool) – whether to build a cuda tensors or not.
Returns:

A float tensor build from the values contained in the input array.

Value Functions

compute_advantage_montecarlo(V, s, ss, r, absorbing, gamma)[source]

Function to estimate the advantage and new value function target over a dataset. The value function is estimated using rollouts (monte carlo estimation).

Parameters:
  • V (Regressor) – the current value function regressor;
  • s (numpy.ndarray) – the set of states in which we want to evaluate the advantage;
  • ss (numpy.ndarray) – the set of next states in which we want to evaluate the advantage;
  • r (numpy.ndarray) – the reward obtained in each transition from state s to state ss;
  • absorbing (numpy.ndarray) – an array of boolean flags indicating if the reached state is absorbing;
  • gamma (float) – the discount factor of the considered problem.
Returns:

The new estimate for the value function of the next state and the advantage function.

compute_advantage(V, s, ss, r, absorbing, gamma)[source]

Function to estimate the advantage and new value function target over a dataset. The value function is estimated using bootstrapping.

Parameters:
  • V (Regressor) – the current value function regressor;
  • s (numpy.ndarray) – the set of states in which we want to evaluate the advantage;
  • ss (numpy.ndarray) – the set of next states in which we want to evaluate the advantage;
  • r (numpy.ndarray) – the reward obtained in each transition from state s to state ss;
  • absorbing (numpy.ndarray) – an array of boolean flags indicating if the reached state is absorbing;
  • gamma (float) – the discount factor of the considered problem.
Returns:

The new estimate for the value function of the next state and the advantage function.

compute_gae(V, s, ss, r, absorbing, last, gamma, lam)[source]

Function to compute Generalized Advantage Estimation (GAE) and new value function target over a dataset.

“High-Dimensional Continuous Control Using Generalized Advantage Estimation”. Schulman J. et al.. 2016.

Parameters:
  • V (Regressor) – the current value function regressor;
  • s (numpy.ndarray) – the set of states in which we want to evaluate the advantage;
  • ss (numpy.ndarray) – the set of next states in which we want to evaluate the advantage;
  • r (numpy.ndarray) – the reward obtained in each transition from state s to state ss;
  • absorbing (numpy.ndarray) – an array of boolean flags indicating if the reached state is absorbing;
  • last (numpy.ndarray) – an array of boolean flags indicating if the reached state is the last of the trajectory;
  • gamma (float) – the discount factor of the considered problem;
  • lam (float) – the value for the lamba coefficient used by GEA algorithm.
Returns:

The new estimate for the value function of the next state and the estimated generalized advantage.

Variance parameters

class VarianceParameter(value, exponential=False, min_value=None, tol=1.0, size=(1, ))[source]

Bases: mushroom_rl.utils.parameters.Parameter

Abstract class to implement variance-dependent parameters. A target parameter is expected.

__init__(value, exponential=False, min_value=None, tol=1.0, size=(1, ))[source]

Constructor.

Parameters:tol (float) – value of the variance of the target variable such that The parameter value is 0.5.
_compute(*idx, **kwargs)[source]

Returns: The value of the parameter in the provided index.

update(*idx, **kwargs)[source]

Updates the value of the parameter in the provided index.

Parameters:
  • *idx (list) – index of the parameter whose number of visits has to be updated.
  • target (float) – Value of the target variable;
  • factor (float) – Multiplicative factor for the parameter value, useful when the parameter depend on another parameter value.
__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
class VarianceIncreasingParameter(value, exponential=False, min_value=None, tol=1.0, size=(1, ))[source]

Bases: mushroom_rl.utils.variance_parameters.VarianceParameter

Class implementing a parameter that increases with the target variance.

__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
__init__(value, exponential=False, min_value=None, tol=1.0, size=(1, ))

Constructor.

Parameters:tol (float) – value of the variance of the target variable such that The parameter value is 0.5.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_compute(*idx, **kwargs)

Returns: The value of the parameter in the provided index.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the value of the parameter in the provided index.

Parameters:
  • *idx (list) – index of the parameter whose number of visits has to be updated.
  • target (float) – Value of the target variable;
  • factor (float) – Multiplicative factor for the parameter value, useful when the parameter depend on another parameter value.
class VarianceDecreasingParameter(value, exponential=False, min_value=None, tol=1.0, size=(1, ))[source]

Bases: mushroom_rl.utils.variance_parameters.VarianceParameter

Class implementing a parameter that decreases with the target variance.

__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
__init__(value, exponential=False, min_value=None, tol=1.0, size=(1, ))

Constructor.

Parameters:tol (float) – value of the variance of the target variable such that The parameter value is 0.5.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_compute(*idx, **kwargs)

Returns: The value of the parameter in the provided index.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the value of the parameter in the provided index.

Parameters:
  • *idx (list) – index of the parameter whose number of visits has to be updated.
  • target (float) – Value of the target variable;
  • factor (float) – Multiplicative factor for the parameter value, useful when the parameter depend on another parameter value.
class WindowedVarianceParameter(value, exponential=False, min_value=None, tol=1.0, window=100, size=(1, ))[source]

Bases: mushroom_rl.utils.parameters.Parameter

Abstract class to implement variance-dependent parameters. A target parameter is expected. differently from the “Variance Parameter” class the variance is computed in a window interval.

__init__(value, exponential=False, min_value=None, tol=1.0, window=100, size=(1, ))[source]

Constructor.

Parameters:
  • tol (float) – value of the variance of the target variable such that the parameter value is 0.5.
  • window (int) –
_compute(*idx, **kwargs)[source]

Returns: The value of the parameter in the provided index.

update(*idx, **kwargs)[source]

Updates the value of the parameter in the provided index.

Parameters:
  • *idx (list) – index of the parameter whose number of visits has to be updated.
  • target (float) – Value of the target variable;
  • factor (float) – Multiplicative factor for the parameter value, useful when the parameter depend on another parameter value.
__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
class WindowedVarianceIncreasingParameter(value, exponential=False, min_value=None, tol=1.0, window=100, size=(1, ))[source]

Bases: mushroom_rl.utils.variance_parameters.WindowedVarianceParameter

Class implementing a parameter that decreases with the target variance, where the variance is computed in a fixed length window.

__call__(*idx, **kwargs)

Update and return the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The updated parameter in the provided index.
__init__(value, exponential=False, min_value=None, tol=1.0, window=100, size=(1, ))

Constructor.

Parameters:
  • tol (float) – value of the variance of the target variable such that the parameter value is 0.5.
  • window (int) –
_add_save_attr(**attr_dict)

Add attributes that should be saved for an agent. For every attribute, it is necessary to specify the method to be used to save and load. Available methods are: numpy, mushroom, torch, json, pickle, primitive and none. The primitive method can be used to store primitive attributes, while the none method always skip the attribute, but ensure that it is initialized to None after the load. The mushroom method can be used with classes that implement the Serializable interface. All the other methods use the library named. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

Parameters:**attr_dict – dictionary of attributes mapped to the method that should be used to save and load them.
_compute(*idx, **kwargs)

Returns: The value of the parameter in the provided index.

_post_load()

This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()
Returns:A deepcopy of the agent.
get_value(*idx, **kwargs)

Return the current value of the parameter in the provided index.

Parameters:*idx (list) – index of the parameter to return.
Returns:The current value of the parameter in the provided index.
initial_value

The initial value of the parameters.

Type:Returns
classmethod load(path)

Load and deserialize the agent from the given location on disk.

Parameters:path (Path, string) – Relative or absolute path to the agents save location.
Returns:The loaded agent.
save(path, full_save=False)

Serialize and save the object to the given path on disk.

Parameters:
  • path (Path, str) – Relative or absolute path to the object save location;
  • full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.
save_zip(zip_file, full_save, folder='')

Serialize and save the agent to the given path on disk.

Parameters:
  • zip_file (ZipFile) – ZipFile where te object needs to be saved;
  • full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
  • folder (string, '') – subfolder to be used by the save method.
shape

The shape of the table of parameters.

Type:Returns
update(*idx, **kwargs)

Updates the value of the parameter in the provided index.

Parameters:
  • *idx (list) – index of the parameter whose number of visits has to be updated.
  • target (float) – Value of the target variable;
  • factor (float) – Multiplicative factor for the parameter value, useful when the parameter depend on another parameter value.

Viewer

class ImageViewer(size, dt)[source]

Bases: object

Interface to pygame for visualizing plain images.

__init__(size, dt)[source]

Constructor.

Parameters:
  • size ([list, tuple]) – size of the displayed image;
  • dt (float) – duration of a control step.
display(img)[source]

Display given frame.

Parameters:img – image to display.
class Viewer(env_width, env_height, width=500, height=500, background=(0, 0, 0))[source]

Bases: object

Interface to pygame for visualizing mushroom native environments.

__init__(env_width, env_height, width=500, height=500, background=(0, 0, 0))[source]

Constructor.

Parameters:
  • env_width (int) – The x dimension limit of the desired environment;
  • env_height (int) – The y dimension limit of the desired environment;
  • width (int, 500) – width of the environment window;
  • height (int, 500) – height of the environment window;
  • background (tuple, (0, 0, 0)) – background color of the screen.
screen

Property.

Returns:The screen created by this viewer.
size

Property.

Returns:The size of the screen.
line(start, end, color=(255, 255, 255), width=1)[source]

Draw a line on the screen.

Parameters:
  • start (np.ndarray) – starting point of the line;
  • end (np.ndarray) – end point of the line;
  • color (tuple (255, 255, 255)) – color of the line;
  • width (int, 1) – width of the line.
square(center, angle, edge, color=(255, 255, 255), width=0)[source]

Draw a square on the screen and apply a roto-translation to it.

Parameters:
  • center (np.ndarray) – the center of the polygon;
  • angle (float) – the rotation to apply to the polygon;
  • edge (float) – length of an edge;
  • color (tuple, (255, 255, 255)) – the color of the polygon;
  • width (int, 0) – the width of the polygon line, 0 to fill the polygon.
polygon(center, angle, points, color=(255, 255, 255), width=0)[source]

Draw a polygon on the screen and apply a roto-translation to it.

Parameters:
  • center (np.ndarray) – the center of the polygon;
  • angle (float) – the rotation to apply to the polygon;
  • points (list) – the points of the polygon w.r.t. the center;
  • color (tuple, (255, 255, 255)) – the color of the polygon;
  • width (int, 0) – the width of the polygon line, 0 to fill the polygon.
circle(center, radius, color=(255, 255, 255), width=0)[source]

Draw a circle on the screen.

Parameters:
  • center (np.ndarray) – the center of the circle;
  • radius (float) – the radius of the circle;
  • color (tuple, (255, 255, 255)) – the color of the circle;
  • width (int, 0) – the width of the circle line, 0 to fill the circle.
arrow_head(center, scale, angle, color=(255, 255, 255))[source]

Draw an harrow head.

Parameters:
  • center (np.ndarray) – the position of the arrow head;
  • scale (float) – scale of the arrow, correspond to the length;
  • angle (float) – the angle of rotation of the angle head;
  • color (tuple, (255, 255, 255)) – the color of the arrow.
force_arrow(center, direction, force, max_force, max_length, color=(255, 255, 255), width=1)[source]

Draw a force arrow, i.e. an arrow representing a force. The length of the arrow is directly proportional to the force value.

Parameters:
  • center (np.ndarray) – the point where the force is applied;
  • direction (np.ndarray) – the direction of the force;
  • force (float) – the applied force value;
  • max_force (float) – the maximum force value;
  • max_length (float) – the length to use for the maximum force;
  • color (tuple, (255, 255, 255)) – the color of the arrow;
  • width (int, 1) – the width of the force arrow.
torque_arrow(center, torque, max_torque, max_radius, color=(255, 255, 255), width=1)[source]

Draw a torque arrow, i.e. a circular arrow representing a torque. The radius of the arrow is directly proportional to the torque value.

Parameters:
  • center (np.ndarray) – the point where the torque is applied;
  • torque (float) – the applied torque value;
  • max_torque (float) – the maximum torque value;
  • max_radius (float) – the radius to use for the maximum torque;
  • color (tuple, (255, 255, 255)) – the color of the arrow;
  • width (int, 1) – the width of the torque arrow.
background_image(img)[source]

Use the given image as background for the window, rescaling it appropriately.

Parameters:img – the image to be used.
function(x_s, x_e, f, n_points=100, width=1, color=(255, 255, 255))[source]

Draw the graph of a function in the image.

Parameters:
  • x_s (float) – starting x coordinate;
  • x_e (float) – final x coordinate;
  • f (function) – the function that maps x coorinates into y coordinates;
  • n_points (int, 100) – the number of segments used to approximate the function to draw;
  • width (int, 1) – thw width of the line drawn;
  • color (tuple, (255,255,255)) – the color of the line.
display(s)[source]

Display current frame and initialize the next frame to the background color.

Parameters:s – time to wait in visualization.
close()[source]

Close the viewer, destroy the window.

Tutorials

How to make a simple experiment

The main purpose of MushroomRL is to simplify the scripting of RL experiments. A standard example of a script to run an experiment in MushroomRL, consists of:

  • an initial part where the setting of the experiment are specified;
  • a middle part where the experiment is run;
  • a final part where operations like evaluation, plot and save can be done.

A RL experiment consists of:

  • a MDP;
  • an agent;
  • a core.

A MDP is the problem to be solved by the agent. It contains the function to move the agent in the environment according to the provided action. The MDP can be simply created with:

import numpy as np
from sklearn.ensemble import ExtraTreesRegressor

from mushroom_rl.algorithms.value import FQI
from mushroom_rl.core import Core
from mushroom_rl.environments import CarOnHill
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.dataset import compute_J
from mushroom_rl.utils.parameters import Parameter

mdp = CarOnHill()

A MushroomRL agent is the algorithm that is run to learn in the MDP. It consists of a policy approximator and of the methods to improve the policy during the learning. It also contains the features to extract in the case of MDP with continuous state and action spaces. An agent can be defined this way:

# Policy
epsilon = Parameter(value=1.)
pi = EpsGreedy(epsilon=epsilon)

# Approximator
approximator_params = dict(input_shape=mdp.info.observation_space.shape,
                           n_actions=mdp.info.action_space.n,
                           n_estimators=50,
                           min_samples_split=5,
                           min_samples_leaf=2)
approximator = ExtraTreesRegressor

# Agent
agent = FQI(mdp.info, pi, approximator, n_iterations=20,
            approximator_params=approximator_params)

This piece of code creates the policy followed by the agent (e.g. \(\varepsilon\)-greedy) with \(\varepsilon = 1\). Then, the policy approximator is created specifying the parameters to create it and the class (in this case, the ExtraTreesRegressor class of scikit-learn is used). Eventually, the agent is created calling the algorithm class and providing the approximator and the policy, together with parameters used by the algorithm.

To run the experiment, the core module has to be used. This module requires the agent and the MDP object and contains the function to learn in the MDP and evaluate the learned policy. It can be created with:

core = Core(agent, mdp)

Once the core has been created, the agent can be trained collecting a dataset and fitting the policy:

core.learn(n_episodes=1000, n_episodes_per_fit=1000)

In this case, the agent’s policy is fitted only once, after that 1000 episodes have been collected. This is a common practice in batch RL algorithms such as FQI where, initially, samples are randomly collected and then the policy is fitted using the whole dataset of collected samples.

Eventually, some operations to evaluate the learned policy can be done. This way the user can, for instance, compute the performance of the agent through the collected rewards during an evaluation run. Fixing \(\varepsilon = 0\), the greedy policy is applied starting from the provided initial states, then the average cumulative discounted reward is returned.

pi.set_epsilon(Parameter(0.))
initial_state = np.array([[-.5, 0.]])
dataset = core.evaluate(initial_states=initial_state)

print(compute_J(dataset, gamma=mdp.info.gamma))

How to make an advanced experiment

Continuous MDPs are a challenging class of problems to solve in RL. In these problems, a tabular regressor is not enough to approximate the Q-function, since there are an infinite number of states/actions. The solution to solve them is to use a function approximator (e.g. neural network) fed with the raw values of states and actions. In the case a linear approximator is used, it is convenient to enlarge the input space with the space of non-linear features extracted from the raw values. This way, the linear approximator is often able to solve the MDPs, despite its simplicity. Many RL algorithms rely on the use of a linear approximator to solve a MDP, therefore the use of features is very important. This tutorial shows how to solve a continuous MDP in MushroomRL using an algorithm that requires the use of a linear approximator.

Initially, the MDP and the policy are created:

import numpy as np

from mushroom_rl.algorithms.value import SARSALambdaContinuous
from mushroom_rl.approximators.parametric import LinearApproximator
from mushroom_rl.core import Core
from mushroom_rl.features import Features
from mushroom_rl.features.tiles import Tiles
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.callbacks import CollectDataset
from mushroom_rl.utils.parameters import Parameter
from mushroom_rl.environments import Gym

# MDP
mdp = Gym(name='MountainCar-v0', horizon=np.inf, gamma=1.)

# Policy
epsilon = Parameter(value=0.)
pi = EpsGreedy(epsilon=epsilon)

This is an environment created with the MushroomRL interface to the OpenAI Gym library. Each environment offered by OpenAI Gym can be created this way simply providing the corresponding id in the name parameter, except for the Atari that are managed by a separate class. After the creation of the MDP, the tiles features are created:

n_tilings = 10
tilings = Tiles.generate(n_tilings, [10, 10],
                         mdp.info.observation_space.low,
                         mdp.info.observation_space.high)
features = Features(tilings=tilings)

approximator_params = dict(input_shape=(features.size,),
                           output_shape=(mdp.info.action_space.n,),
                           n_actions=mdp.info.action_space.n)

In this example, we use sparse coding by means of tiles features. The generate method generates n_tilings grids of 10x10 tilings evenly spaced (the way the tilings are created is explained in “Reinforcement Learning: An Introduction”, Sutton & Barto, 1998). Eventually, the grid is passed to the Features factory method that returns the features class.

MushroomRL offers other type of features such a radial basis functions and polynomial features. The former have also a faster implementation written in Tensorflow that can be used transparently.

Then, the agent is created as usual, but this time passing the feature to it. It is important to notice that the learning rate is divided by the number of tilings for the correctness of the update (see “Reinforcement Learning: An Introduction”, Sutton & Barto, 1998 for details). After that, the learning is run as usual:

learning_rate = Parameter(.1 / n_tilings)

agent = SARSALambdaContinuous(mdp.info, pi, LinearApproximator,
                              approximator_params=approximator_params,
                              learning_rate=learning_rate,
                              lambda_coeff=.9, features=features)

# Algorithm
collect_dataset = CollectDataset()
callbacks = [collect_dataset]
core = Core(agent, mdp, callbacks_fit=callbacks)

# Train
core.learn(n_episodes=100, n_steps_per_fit=1)

To visualize the learned policy the rendering method of OpenAI Gym is used. To activate the rendering in the environments that supports it, it is necessary to set render=True.

core.evaluate(n_episodes=1, render=True)

How to create a regressor

MushroomRL offers a high-level interface to build function regressors. Indeed, it transparently manages regressors for generic functions and Q-function regressors. The user should not care about the low-level implementation of these regressors and should only use the Regressor interface. This interface creates a Q-function regressor or a GenericRegressor depending on whether the n_actions parameter is provided to the constructor or not.

Usage of the Regressor interface

When the action space of RL problems is finite and the adopted approach is value-based,
we want to compute the Q-function of each action. In MushroomRL, this is possible using:
  • a Q-function regressor with a different approximator for each action (ActionRegressor);
  • a single Q-function regressor with a different output for each action (QRegressor).

The QRegressor is suggested when the number of discrete actions is high, due to memory reasons.

The user can create create a QRegressor or an ActionRegressor, setting the output_shape parameter of the Regressor interface. If it is set to (1,), an ActionRegressor is created; otherwise if it is set to the number of discrete actions, a QRegressor is created.

Example

Initially, the MDP, the policy and the features are created:

import numpy as np

from mushroom_rl.algorithms.value import SARSALambdaContinuous
from mushroom_rl.approximators.parametric import LinearApproximator
from mushroom_rl.core import Core
from mushroom_rl.environments import *
from mushroom_rl.features import Features
from mushroom_rl.features.tiles import Tiles
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.callbacks import CollectDataset
from mushroom_rl.utils.parameters import Parameter


# MDP
mdp = Gym(name='MountainCar-v0', horizon=np.inf, gamma=1.)

# Policy
epsilon = Parameter(value=0.)
pi = EpsGreedy(epsilon=epsilon)

# Q-function approximator
n_tilings = 10
tilings = Tiles.generate(n_tilings, [10, 10],
                         mdp.info.observation_space.low,
                         mdp.info.observation_space.high)
features = Features(tilings=tilings)

# Agent
learning_rate = Parameter(.1 / n_tilings)

The following snippet, sets the output shape of the regressor to the number of actions, creating a QRegressor:

approximator_params = dict(input_shape=(features.size,),
                           output_shape=(mdp.info.action_space.n,),
                           n_actions=mdp.info.action_space.n)

If you prefer to use an ActionRegressor, simply set the number of actions to (1,):

approximator_params = dict(input_shape=(features.size,),
                           output_shape=(1,),
                           n_actions=mdp.info.action_space.n)

Then, the rest of the code fits the approximator and runs the evaluation rendering the behaviour of the agent:

agent = SARSALambdaContinuous(mdp.info, pi, LinearApproximator,
                              approximator_params=approximator_params,
                              learning_rate=learning_rate,
                              lambda_coeff= .9, features=features)

# Algorithm
collect_dataset = CollectDataset()
callbacks = [collect_dataset]
core = Core(agent, mdp, callbacks_fit=callbacks)

# Train
core.learn(n_episodes=100, n_steps_per_fit=1)

# Evaluate
core.evaluate(n_episodes=1, render=True)

Generic regressor

Whenever the n_actions parameter is not provided, the Regressor interface creates a GenericRegressor. This regressor can be used for general purposes and it is more flexible to be used. It is commonly used in policy search algorithms.

Example

Create a dataset of points distributed on a line with random gaussian noise.

import numpy as np
from matplotlib import pyplot as plt

from mushroom_rl.approximators import Regressor
from mushroom_rl.approximators.parametric import LinearApproximator


x = np.arange(10).reshape(-1, 1)

intercept = 10
noise = np.random.randn(10, 1) * 1
y = 2 * x + intercept + noise

To fit the intercept, polynomial features of degree 1 are created by hand:

phi = np.concatenate((np.ones(10).reshape(-1, 1), x), axis=1)

The regressor is then created and fit (note that n_actions is not provided):

regressor = Regressor(LinearApproximator,
                      input_shape=(2,),
                      output_shape=(1,))

regressor.fit(phi, y)

Eventually, the approximated function of the regressor is plotted together with the target points. Moreover, the weights and the gradient in point 5 of the linear approximator are printed.

print('Weights: ' + str(regressor.get_weights()))
print('Gradient: ' + str(regressor.diff(np.array([[5.]]))))

plt.scatter(x, y)
plt.plot(x, regressor.predict(phi))
plt.show()

How to make a deep RL experiment

The usual script to run a deep RL experiment does not significantly differ from the one for a shallow RL experiment. This tutorial shows how to solve Atari games in MushroomRL using DQN, and how to solve MuJoCo tasks using DDPG. This tutorial will not explain some technicalities that are already described in the previous tutorials, and will only briefly explain how to run deep RL experiments. Be sure to read the previous tutorials before starting this one.

Solving Atari with DQN

This script runs the experiment to solve the Atari Breakout game as described in the DQN paper “Human-level control through deep reinforcement learning”, Mnih V. et al., 2015). We start creating the neural network to learn the action-value function:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from mushroom_rl.algorithms.value import DQN
from mushroom_rl.approximators.parametric import TorchApproximator
from mushroom_rl.core import Core
from mushroom_rl.environments import Atari
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.dataset import compute_metrics
from mushroom_rl.utils.parameters import LinearParameter, Parameter


class Network(nn.Module):
    n_features = 512

    def __init__(self, input_shape, output_shape, **kwargs):
        super().__init__()

        n_input = input_shape[0]
        n_output = output_shape[0]

        self._h1 = nn.Conv2d(n_input, 32, kernel_size=8, stride=4)
        self._h2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self._h3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self._h4 = nn.Linear(3136, self.n_features)
        self._h5 = nn.Linear(self.n_features, n_output)

        nn.init.xavier_uniform_(self._h1.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h2.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h3.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h4.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h5.weight,
                                gain=nn.init.calculate_gain('linear'))

    def forward(self, state, action=None):
        h = F.relu(self._h1(state.float() / 255.))
        h = F.relu(self._h2(h))
        h = F.relu(self._h3(h))
        h = F.relu(self._h4(h.view(-1, 3136)))
        q = self._h5(h)

        if action is None:
            return q
        else:
            q_acted = torch.squeeze(q.gather(1, action.long()))

            return q_acted

Note that the forward function may return all the action-values of state, or only the one for the provided action. This network will be used later in the script. Now, we define useful functions, set some hyperparameters, and create the mdp and the policy pi:

def print_epoch(epoch):
    print('################################################################')
    print('Epoch: ', epoch)
    print('----------------------------------------------------------------')


def get_stats(dataset):
    score = compute_metrics(dataset)
    print(('min_reward: %f, max_reward: %f, mean_reward: %f,'
          ' games_completed: %d' % score))

    return score


scores = list()

optimizer = dict()
optimizer['class'] = optim.Adam
optimizer['params'] = dict(lr=.00025)

# Settings
width = 84
height = 84
history_length = 4
train_frequency = 4
evaluation_frequency = 250000
target_update_frequency = 10000
initial_replay_size = 50000
max_replay_size = 500000
test_samples = 125000
max_steps = 50000000

# MDP
mdp = Atari('BreakoutDeterministic-v4', width, height, ends_at_life=True,
            history_length=history_length, max_no_op_actions=30)

# Policy
epsilon = LinearParameter(value=1.,
                          threshold_value=.1,
                          n=1000000)
epsilon_test = Parameter(value=.05)
epsilon_random = Parameter(value=1)
pi = EpsGreedy(epsilon=epsilon_random)

Differently from the literature, we use Adam as the optimizer.

Then, the approximator:

# Approximator
input_shape = (history_length, height, width)
approximator_params = dict(
    network=Network,
    input_shape=input_shape,
    output_shape=(mdp.info.action_space.n,),
    n_actions=mdp.info.action_space.n,
    n_features=Network.n_features,
    optimizer=optimizer,
    loss=F.smooth_l1_loss
)

approximator = TorchApproximator

Finally, the agent and the core:

# Agent
algorithm_params = dict(
    batch_size=32,
    target_update_frequency=target_update_frequency // train_frequency,
    replay_memory=None,
    initial_replay_size=initial_replay_size,
    max_replay_size=max_replay_size
)

agent = DQN(mdp.info, pi, approximator,
            approximator_params=approximator_params,
            **algorithm_params)

# Algorithm
core = Core(agent, mdp)

Eventually, the learning loop is performed. As done in literature, learning and evaluation steps are alternated:

# RUN

# Fill replay memory with random dataset
print_epoch(0)
core.learn(n_steps=initial_replay_size,
           n_steps_per_fit=initial_replay_size)

# Evaluate initial policy
pi.set_epsilon(epsilon_test)
mdp.set_episode_end(False)
dataset = core.evaluate(n_steps=test_samples)
scores.append(get_stats(dataset))

for n_epoch in range(1, max_steps // evaluation_frequency + 1):
    print_epoch(n_epoch)
    print('- Learning:')
    # learning step
    pi.set_epsilon(epsilon)
    mdp.set_episode_end(True)
    core.learn(n_steps=evaluation_frequency,
               n_steps_per_fit=train_frequency)

    print('- Evaluation:')
    # evaluation step
    pi.set_epsilon(epsilon_test)
    mdp.set_episode_end(False)
    dataset = core.evaluate(n_steps=test_samples)
    scores.append(get_stats(dataset))

Solving MuJoCo with DDPG

This script runs the experiment to solve the Walker-Stand MuJoCo task, as implemented in MuJoCo. As with DQN, we start creating the neural networks. For DDPG, we need an actor and a critic network:

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from mushroom_rl.algorithms.actor_critic import DDPG
from mushroom_rl.core import Core
from mushroom_rl.environments.dm_control_env import DMControl
from mushroom_rl.policy import OrnsteinUhlenbeckPolicy
from mushroom_rl.utils.dataset import compute_J


class CriticNetwork(nn.Module):
    def __init__(self, input_shape, output_shape, n_features, **kwargs):
        super().__init__()

        n_input = input_shape[-1]
        n_output = output_shape[0]

        self._h1 = nn.Linear(n_input, n_features)
        self._h2 = nn.Linear(n_features, n_features)
        self._h3 = nn.Linear(n_features, n_output)

        nn.init.xavier_uniform_(self._h1.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h2.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h3.weight,
                                gain=nn.init.calculate_gain('linear'))

    def forward(self, state, action):
        state_action = torch.cat((state.float(), action.float()), dim=1)
        features1 = F.relu(self._h1(state_action))
        features2 = F.relu(self._h2(features1))
        q = self._h3(features2)

        return torch.squeeze(q)


class ActorNetwork(nn.Module):
    def __init__(self, input_shape, output_shape, n_features, **kwargs):
        super(ActorNetwork, self).__init__()

        n_input = input_shape[-1]
        n_output = output_shape[0]

        self._h1 = nn.Linear(n_input, n_features)
        self._h2 = nn.Linear(n_features, n_features)
        self._h3 = nn.Linear(n_features, n_output)

        nn.init.xavier_uniform_(self._h1.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h2.weight,
                                gain=nn.init.calculate_gain('relu'))
        nn.init.xavier_uniform_(self._h3.weight,
                                gain=nn.init.calculate_gain('linear'))

    def forward(self, state):
        features1 = F.relu(self._h1(torch.squeeze(state, 1).float()))
        features2 = F.relu(self._h2(features1))
        a = self._h3(features2)

        return a

We create the mdp, the policy, and set some hyperparameters:

# MDP
horizon = 500
gamma = 0.99
gamma_eval = 1.
mdp = DMControl('walker', 'stand', horizon, gamma)

# Policy
policy_class = OrnsteinUhlenbeckPolicy
policy_params = dict(sigma=np.ones(1) * .2, theta=.15, dt=1e-2)

# Settings
initial_replay_size = 500
max_replay_size = 5000
batch_size = 200
n_features = 80
tau = .001

Note that the policy is not instatiated in the script, since in DDPG the instatiation is done inside the algorithm constructor.

We create the actor and the critic approximators:

# Approximator
actor_input_shape = mdp.info.observation_space.shape
actor_params = dict(network=ActorNetwork,
                    n_features=n_features,
                    input_shape=actor_input_shape,
                    output_shape=mdp.info.action_space.shape)

actor_optimizer = {'class': optim.Adam,
                   'params': {'lr': 1e-5}}

critic_input_shape = (actor_input_shape[0] + mdp.info.action_space.shape[0],)
critic_params = dict(network=CriticNetwork,
                     optimizer={'class': optim.Adam,
                                'params': {'lr': 1e-3}},
                     loss=F.mse_loss,
                     n_features=n_features,
                     input_shape=critic_input_shape,
                     output_shape=(1,))

Finally, we create the agent and the core:

# Agent
agent = DDPG(mdp.info, policy_class, policy_params,
             actor_params, actor_optimizer, critic_params,
             batch_size, initial_replay_size, max_replay_size,
             tau)

# Algorithm
core = Core(agent, mdp)

As in DQN, we alternate learning and evaluation steps:


# Fill the replay memory with random samples
core.learn(n_steps=initial_replay_size, n_steps_per_fit=initial_replay_size)

# RUN
n_epochs = 40
n_steps = 1000
n_steps_test = 2000

dataset = core.evaluate(n_steps=n_steps_test, render=False)
J = compute_J(dataset, gamma_eval)
print('Epoch: 0')
print('J: ', np.mean(J))

for n in range(n_epochs):
    print('Epoch: ', n+1)
    core.learn(n_steps=n_steps, n_steps_per_fit=1)
    dataset = core.evaluate(n_steps=n_steps_test, render=False)

How to use the Logger

Here we explain in detail the usage of the MushroomRL Logger class. This class can be used as a standardized console logger and can also log on disk Numpy arrays or a mushroom agent, using the appropriate logging folder.

Constructing the Logger

To initialize the logger we can simply choose a log directory and an experiment name:

from mushroom_rl.core import Logger

# Create a logger object, creating a log folder
logger = Logger('tutorial', results_dir='/tmp/logs',
                log_console=True)

This will create the experiment folder named ‘tutorial’ inside the base folder ‘/tmp/logs’. The logger creates all the necessary directories if they do not exist. If results_dir is not specified, the log will create a ‘./logs’ base directory. By setting log_console to true, the logger will store the console output in a ‘.log’ text file inside the experiment folder, with the same name. If the file already exists, the logger will append the new logged lines.

If you do not want the logger to create any directory e.g., to only use the log for the console output, you can force the results_dir parameter to None:

# Create a logger object, without creating the log folder
logger_no_folder = Logger('tutorial_no_folder', results_dir=None)

Logging message on the console

The most basic functionality of the Logger is to output text messages on the standard output. Our logger uses the standard Python logger, and it follows a similar set of functionalities:

# Write a line of hashtags, to be used as a separator
logger.strong_line()

# Print an info message
logger.debug('This is a debug message')

# Print an info message
logger.info('This is an info message')

# Print a warning
logger.warning('This is a warning message')

# Print an error
logger.error('This is an error message')

# Print a critical error message
logger.critical('This is a critical error')

# Print a line of dashes, to be used as a (weak) separator
logger.weak_line()

We can also log to terminal the exceptions. Using this method, instead of a raw print, you can manage correctly the exception output without breaking any tqdm progress bar (see below), and the exception text will be saved in the console log files (if console logging is active).

# Exception logging
try:
    raise RuntimeError('A runtime exception occurred')
except RuntimeError as e:
    logger.error('Exception catched, here\'s the stack trace:')
    logger.exception(e)

logger.weak_line()

Logging a Reinforcement Learning experiment

Our Logger includes some functionalities to log RL experiment data easily. To demonstrate this, we will set up a simple RL experiment, using Q-Learning in the simple chain enviornment.

# Logging learning process
from mushroom_rl.core import Core
from mushroom_rl.environments.generators import generate_simple_chain
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.algorithms.value import QLearning
from mushroom_rl.utils.parameters import Parameter
from mushroom_rl.utils.dataset import compute_J
from tqdm import trange
from time import sleep
import numpy as np


# Setup simple learning environment
mdp = generate_simple_chain(state_n=5, goal_states=[2], prob=.8, rew=1, gamma=.9)
epsilon = Parameter(value=.15)
pi = EpsGreedy(epsilon=epsilon)
agent = QLearning(mdp.info, pi, learning_rate=Parameter(value=.2))
core = Core(agent, mdp)
epochs = 10

We skip the details of this RL experiment, as they are not relevant to the current tutorial. You can have a deeper look at RL experiments with MushroomRL in other tutorials.

It is important to notice that we use tqdm progress bar, as our logger is integrated with this package, and can print log messages while the progress bar is showing progress, without disrupting the progress bar and the terminal.

We first print the learning performances before the learning, using the epoch_info method:

# Initial policy Evaluation
logger.info('Experiment started')
logger.strong_line()

dataset = core.evaluate(n_steps=100)
J = np.mean(compute_J(dataset, mdp.info.gamma))  # Discounted returns
R = np.mean(compute_J(dataset))  # Undiscounted returns

logger.epoch_info(0, J=J, R=R, any_label='any value')

Notice that this method can print any possible label passed as a function parameter, so it’s not restricted to J, R, or other predefined metrics.

We now consider the learning loop:

for i in trange(epochs):
    # Here some learning
    core.learn(n_steps=100, n_steps_per_fit=1)
    sleep(0.5)
    dataset = core.evaluate(n_steps=100)
    sleep(0.5)
    J = np.mean(compute_J(dataset, mdp.info.gamma))  # Discounted returns
    R = np.mean(compute_J(dataset))  # Undiscounted returns

    # Here logging epoch results to the console
    logger.epoch_info(i+1, J=J, R=R)

    # Logging the data in J.npy and E.npy
    logger.log_numpy(J=J, R=R)

    # Logging the best agent according to the best J
    logger.log_best_agent(agent, J)

Here we make use of both the epoch_info method to log the data in the console output and the methods log_numpy and log_best_agent to log the learning progress.

The log_numpy method can take an arbitrary value (primitive or a NumPy array) and log into a single NumPy array (or matrix). Again a set of arbitrary keywords can be used to save data into different filenames. If the seed parameter of the constructor of the Logger class is specified, the filename will include a postfix with the seed. This is useful when multiple runs of the same experiment are executed.

The log_best_agent saves the current agent, into the ‘agent-best.msh’ file. However, the current agent will be stored on disk only if it improves w.r.t. the previously logged one.

We conclude the learning experiment by logging the final agent and the last dataset:

# Logging the last agent
logger.log_agent(agent)

# Log the last dataset
logger.log_dataset(dataset)

logger.info('Experiment terminated')

Advanced Logger topics

The logger can be also used to continue the learning from a previously existing run, without overwriting the stored results values. This can be done by specifying the append flag in the logger’s constructor.

# Loggers can also continue from previous logs results
del logger  # Delete previous logger
new_logger = Logger('tutorial', results_dir='/tmp/logs',
                    log_console=True, append=True)

# add infinite at the end of J.npy
new_logger.log_numpy(J=np.inf)
new_logger.info('Tutorial ends here')

Finally, another functionality of the logger is to activate some specific text output from some algorithms. This can be done by calling the agent’s set_logger method:

agent.set_logger(logger)

Currently, only the PPO and the TRPO algorithms provide additional output, by describing some learning metrics after every fit.

How to use the Environment interface

Here we explain in detail the usage of the MushroomRL Environment interface. First, we explain how to use the registration interface. The registration enables the construction of environments from string specification. Then we construct a toy environment to show how it is possible to add new MushroomRL environments.

Old-school enviroment creation

In MushroomRL, environments are simply class objects that extend the environment interface. To create an environment, you can simply call its constructor. You can build the Segway environment as follows:

from mushroom_rl.environments import Segway

env = Segway()

Some environments may have a constructor which is too low level, and you may want to generate a vanilla version of it using as few parameters as possible. An example is the Linear Quadratic Regulator (LQR) environment, which requires a set of matrices to define the linear dynamics and the quadratic cost function. To provide an easier interface, the generate class method is exposed. To generate a simple 3-dimensional LQR problem, with Identity transition and action matrices, and a trivial quadratic cost function, you can use:

from mushroom_rl.environments import LQR

env = LQR.generate(dimensions=3)

See the documentation of LQR.generate to know all the available parameters and effects.

Environment registration

From version 1.7.0, it is possible to register MushroomRL environments and build the environment by specifying only the name.

You can list the registered environments as follows:

from mushroom_rl.core import Environment

env_list = Environment.list_registered()
print(env_list)

Every registered environment can be build using the name. For example, to create the ShipSteering environment you can use:

env = Environment.make('ShipSteering')

To build environments, you may need to pass additional parameters. An example of this is the Gym environment which wraps most OpenAI Gym environments, except the Atari ones, which uses the Atari environment to implement proper preprocessing.

If you want to build the Pendulum-v0 gym environment you need to pass the environment name as a parameter:

env = Environment.make('Gym', 'Pendulum-v0')

However, for environments that are interfaces to other libraries such as Gym, Atari or DMControl a notation with a dot separator is supported. For example to create the pendulum you can also use:

env = Environment.make('Gym.Pendulum-v0')

Or, to create the hopper environment with hop task from DeepMind control suite you can use:

env = Environment.make('DMControl.hopper.hop')

If an environment implements the generate method, it will be used to build the environment instead of the constructor. As the generate method is higher-level interface w.r.t. the constructor, it will require less parameters.

To generate the 3-dimensional LQR problem mentioned in the previous section you can use:

env = Environment.generate('LQR', dimensions=3)

Finally, you can register new environments. Suppose that you have created the environment class MyNewEnv, which extends the base Environment class. You can register the environment as follows:

MyNewEnv.register()

You can put this line of code after the class declaration, or in the __init__.py file of your library. If you do so, the first time you import the file, you will register the environment. Notice that this registration is not saved on disk, thus, you need to register the environment every time the Python interpreter is executed.

Creating a new environment

We show you an example of how to construct a MushroomRL environment. We create a simple room environment, with discrete actions, continuous state space, and mildly stochastic dynamics. The objective is to move the agent from any point of the room towards the goal point. The agent takes a penalty at every step equal to the distance to the objective. When the agent reaches the goal the episode ends. The agent can move in the room by using one of the 4 discrete actions, North, South, West, East.

First of all, we import all the required classes: NumPy for working with the array, the Environment interface and the MDPInfo structure, which contains the basic information about the Environment.

Given that we are implementing a simple visualization function, we import also the viewer class, which is a Pygame wrapper, that can be used to render easily RL environments.

import numpy as np

from mushroom_rl.core import Environment, MDPInfo
from mushroom_rl.utils.spaces import Box, Discrete

from mushroom_rl.utils.viewer import Viewer

Now, we can create the environment class.

We first extend the environment class and create the constructor:

class RoomToyEnv(Environment):
    def __init__(self, size=5., goal=[2.5, 2.5], goal_radius=0.6):

        # Save important environment information
        self._size = size
        self._goal = np.array(goal)
        self._goal_radius = goal_radius

        # Create the action space.
        action_space = Discrete(4)  # 4 actions: N, S, W, E

        # Create the observation space. It's a 2D box of dimension (size x size).
        # You can also specify low and high array, if every component has different limits
        shape = (2,)
        observation_space = Box(0, size, shape)

        # Create the MDPInfo structure, needed by the environment interface
        mdp_info = MDPInfo(observation_space, action_space, gamma=0.99, horizon=100)

        super().__init__(mdp_info)

        # Create a state class variable to store the current state
        self._state = None

        # Create the viewer
        self._viewer = Viewer(size, size)

It’s important to notice that the superclass constructor needs the information stored in the MDPInfo structure. This structure contains the action and observation space, the discount factor gamma, and the horizon. The horizon is used to cut the trajectories when they are too long. When the horizon is reached the episode is terminated, however, the state might not be absorbing. The absorbing state flag is explicitly set in the environment step function. Also, notice that the Environment superclass has no notion of the environment state, so we need to store it by ourselves. That’s why we create the self._state variable and we initialize it to None. Other environment information such as the goal position and area is stored into class variables.

Now we implement the reset function. This function is called at the beginning of every episode. It’s possible to force the initial state. For this reason, we have to manage two scenarios: when the initial state is given and when it is set to None. If the initial state is not given, we sample randomly among the valid states.

    def reset(self, state=None):

        if state is None:
            # Generate randomly a state inside the state space, but not inside the goal
            self._state = np.random.rand(2) * self._size

            # Check if it's inside the goal radius and repeat the sample if necessary
            while np.linalg.norm(self._state - self._goal) < self._goal_radius:
                self._state = np.random.rand(2) * self._size
        else:
            # If an initial state is provided, set it and return, after checking it's valid.
            assert np.all(state < self._size) and np.all(state > 0)
            assert np.linalg.norm(state - self._goal) > self._goal_radius
            self._state = state

        # Return the current state
        return self._state

Now it’s time to implement the step function, that specifies the transition function of the environment, computes the reward, and signal absorbing states, i.e. states where every action keeps you in the same state, achieving 0 reward. When reaching the absorbing state we cut the trajectory, as their value function is always 0, and no further exploration is needed.

    def step(self, action):
        # convert the action in a N, S, W, E movement
        movement = np.zeros(2)
        if action == 0:
            movement[1] += 0.1
        elif action == 1:
            movement[1] -= 0.1
        elif action == 2:
            movement[0] -= 0.1
        elif action == 3:
            movement[0] += 0.1
        else:
            assert ValueError('The environment has only 4 actions')

        # Apply the movement with some noise:
        self._state += movement + np.random.randn(2)*0.05

        # Clip the state space inside the boundaries.
        low = self.info.observation_space.low
        high = self.info.observation_space.high

        self._state = Environment._bound(self._state, low, high)

        # Compute distance form goal
        goal_distance = np.linalg.norm(self._state - self._goal)

        # Compute the reward as distance penalty from goal
        reward = -goal_distance

        # Set the absorbing flag if goal is reached
        absorbing = goal_distance < self._goal_radius

        # Return all the information + empty dictionary (used to pass additional information)
        return self._state, reward, absorbing, {}

Finally, we implement the render function using our Viewer class. This class wraps Pygame to provide an easy visualization tool for 2D Reinforcement Learning algorithms. The viewer class has many functionalities, but here we simply draw two circles representing the agent and the goal area:

    def render(self):
        # Draw a red circle for the agent
        self._viewer.circle(self._state, 0.1, color=(255, 0, 0))

        # Draw a green circle for the goal
        self._viewer.circle(self._goal, self._goal_radius, color=(0, 255, 0))

        # Display the image for 0.1 seconds
        self._viewer.display(0.1)

For more information about the viewer, refer to the class documentation.

To conclude our environment, it’s also possible to register it as specified in the previous section of this tutorial:

# Register the class
RoomToyEnv.register()

Learning in the toy environment

Now that we have created our environment, we try to solve it using Reinforcement Learning. The following code uses the True Online SARSA-Lambda algorithm, exploiting a tiles approximator.

We first import all necessary classes and utilities, then we construct the environment (we set the seed for reproducibility).

if __name__ == '__main__':
    from mushroom_rl.core import Core
    from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
    from mushroom_rl.policy import EpsGreedy
    from mushroom_rl.features import Features
    from mushroom_rl.features.tiles import Tiles
    from mushroom_rl.utils.parameters import Parameter
    from mushroom_rl.utils.dataset import compute_J

    # Set the seed
    np.random.seed(1)

    # Create the toy environment with default parameters
    env = Environment.make('RoomToyEnv')

We now proceed then to create the agent policy, which is a linear policy using tiles features, similar to the one used by the Mountain Car experiment from R. Sutton book.

    # Using an epsilon-greedy policy
    epsilon = Parameter(value=0.1)
    pi = EpsGreedy(epsilon=epsilon)

    # Creating a simple agent using linear approximator with tiles
    n_tilings = 5
    tilings = Tiles.generate(n_tilings, [10, 10],
                             env.info.observation_space.low,
                             env.info.observation_space.high)
    features = Features(tilings=tilings)

    learning_rate = Parameter(.1 / n_tilings)

    approximator_params = dict(input_shape=(features.size,),
                               output_shape=(env.info.action_space.n,),
                               n_actions=env.info.action_space.n)

    agent = TrueOnlineSARSALambda(env.info, pi,
                                  approximator_params=approximator_params,
                                  features=features,
                                  learning_rate=learning_rate,
                                  lambda_coeff=.9)

Finally, using the Core class we set up an RL experiment. We first evaluate the initial policy for three episodes on the environment. Then we learn the task using the algorithm build above for 20000 steps. In the end, we evaluate the learned policy for 3 more episodes.

    # Reinforcement learning experiment
    core = Core(agent, env)

    # Visualize initial policy for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value before learning
    J = np.mean(compute_J(dataset, env.info.gamma))
    print(f'Objective function before learning: {J}')

    # Train
    core.learn(n_steps=20000, n_steps_per_fit=1, render=False)

    # Visualize results for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value after learning
    J = np.mean(compute_J(dataset, env.info.gamma))
    print(f'Objective function after learning: {J}')



How to use the Serializable interface

In this tutorial, we explain in detail the Serializable interface. We first explain how to use classes implementing the Serializable interface, and then we provide a small example of how to implement the Serializable interface on a custom class to serialize the object properly on disk.

The Mushroom RL save format (extension .msh) is nothing else than a zip file, containing some information (stored into the config file) to load the object. This information can be accessed easily and you can try to recover the information by hand from corrupted files.

Note that it is always possible to serialize Python objects with the pickle library. However, the MushroomRL serialization interface use a native format, is easy to use, and is more robust to code changes, as it doesn’t serialize the entire class, but only the data. Furthermore, it is possible to avoid the serialization of some class variables, such as shared objects or big arrays, e.g. replay memories.

Save and load from disk

Many MushroomRL objects implement the serialization interface. All the algorithms, policies, approximators, and parameters implemented in MushroomRL use the Serializable interface.

As an example, we save a MushroomRL Parameter on disk. We create the parameter and then we serialize it to disk using the save method of the serializable class:

from mushroom_rl.utils.parameters import Parameter

parameter = Parameter(1.0)
print('Initial parameter value: ', parameter())
parameter.save('parameter.msh')

This code creates a parameters.msh file in the working directory.

You can also specify a directory:

from pathlib import Path
base_dir = Path('tmp')
file_name = base_dir / 'parameter.msh'
parameter.save(file_name)

This create a tmp folder (if it doesn’t exist) in the working directory and save the parameters.msh file inside it.

Now, we can set another value for our parameter variable:

parameter = Parameter(0.5)
print('Modified parameter value: ', parameter())

Finally, we load the previously stored parameter to go back to the previous state using the load method:

parameter = Parameter.load('parameter.msh')
print('Loaded parameter value: ', parameter())

You can also call the load directly from the Serializable class:

from mushroom_rl.core import Serializable
parameter = Serializable.load('parameter.msh')
print('Loaded parameter value (Serializable): ', parameter())

The same approach can be used to save an agent, a policy, or an approximator.

Full Save

The save method has an optional full_save flag, which by default is set to False. In the previous parameter example, this flag does not affect. However, when saving a Reinforcement Learning algorithm or other complex objects, setting this flag to true forces the agent to save data structures that are normally excluded from a save file, such as the replay memory in DQN.

This implementation choice avoids large save files for agents with huge data structures, and allows to avoid storing duplicated information (such as the Q function of in epsilon greedy policy, when saving the algorithm). The full_save instead, enforces a complete serialization of the agent, retaining all the information.

Implementing the Serializable interface

We give a simple example of how to implement the Serializable interface in MushroomRL for a custom class. We use almost all possible data persistence types implemented.

We start the example by importing the serializable interface, the torch library, the NumPy library, and the MushroomRL Parameter class.

from mushroom_rl.core import Serializable

import torch
import numpy as np
from mushroom_rl.utils.parameters import Parameter

While it is required to import the Serializable interface, the other three imports are only required by this example, as they are used to create variables of such type.

Now we define a class implementing the Serializable interface. In this case, we call it TestClass. The constructor can be divided into two parts: first, we build a set of variables of different types. Then, we call the superclass constructor, i.e. the constructor of Serializable. Finally, we specify which variables we want to be saved in the MushroomRL file passing keywords to the self._add_save_attr method.

class TestClass(Serializable):
    def __init__(self, value):
        # Create some different types of variables

        self._primitive_variable = value  # Primitive python variable
        self._numpy_vector = np.array([1, 2, 3]*value)  # Numpy array
        self._dictionary = dict(some='random', keywords=2, fill='the dictionary')  # A dictionary

        # Building a torch object
        data_array = np.ones(3)*value
        data_tensor = torch.from_numpy(data_array)
        self._torch_object = torch.nn.Parameter(data_tensor)

        # Some variables that implement the Serializable interface
        self._mushroom_parameter = Parameter(2.0*value)
        self._list_of_objects = [Parameter(i) for i in range(value)]  # This is a list!

        # A variable that is not important e.g. a buffer
        self.not_important = np.zeros(10000)

        # A variable that contains a reference to another variable
        self._list_reference = [self._dictionary]

        # Superclass constructor
        super().__init__()

        # Here we specify how to save each component
        self._add_save_attr(
            _primitive_variable='primitive',
            _numpy_vector='numpy',
            _dictionary='pickle',
            _torch_object='torch',
            _mushroom_parameter='mushroom',
            # List of mushroom objects can also be saved with the 'mushroom' mode
            _list_of_objects='mushroom',
            # The '!' is to specify that we save the variable only if full_save is True
            not_important='numpy!',
        )

Some remarks about the self._add_save_attr method: the keyword name must be the name of the variable we want to store in the file, while the associated value is the method we want to use to store such variables.

The available methods are:

  • primitive, to store any primitive type. This includes lists and dictionaries of primitive values.
  • mushroom, to store any type implementing the Serializable interface. Also, lists of serializable objects are supported.
  • numpy, to store NumPy arrays.
  • torch, to store any torch object.
  • pickle, to store any Python object that cannot be stored with the above methods.
  • json, can be used if you need a textual output version, that is easy to read.

Another important aspect to remember is that any method can be ended by a ‘!’, to specify that the field must be serialized if and only if the full_save flag is set to true.

To conclude the implementation of our Serializable interface, we might want to implement also the self._post_load method. This method is executed after all the data specified in self._add_save_attr has been loaded into the class. It can be useful to set the variables not saved in the file to a default variable.

    def _post_load(self):
        if self.not_important is None:
            self.not_important = np.zeros(10000)

        self._list_reference = [self._dictionary]

In this scenario, we have to set the self.not_important variable to his default value, but only if it’s None, i.e. has not been loaded from the file, because the file didn’t contain it. Also, we set the `` self._list_primitive`` variable to maintain its original semantic, i.e. to contain a reference to the content of the self._dictionary variable.

To test the implementation, we write a function to write in easy to read way the content of the class:

def print_variables(obj):
    for label, var in vars(obj).items():
        if label != '_save_attributes':
            if isinstance(var, Parameter):
                print(f'{label}: Parameter({var()})')
            elif isinstance(var, list) and isinstance(var[0], Parameter):
                new_list = [f'Parameter({item()})' for item in var]
                print(f'{label}:  {new_list}')
            else:
                print(label, ': ', var)

Finally, we test the save functionality with the following code:

if __name__ == '__main__':
    # Create test object and print its variables
    test_object = TestClass(1)
    print('###########################################################################################################')
    print('The test object contains the following:')
    print('-----------------------------------------------------------------------------------------------------------')
    print_variables(test_object)

    # Changing the buffer
    test_object.not_important[0] = 1

    # Save the object on disk
    test_object.save('test.msh')

    # Create another test object
    test_object = TestClass(2)
    print('###########################################################################################################')
    print('After overwriting the test object:')
    print('-----------------------------------------------------------------------------------------------------------')
    print_variables(test_object)

    # Changing the buffer again
    test_object.not_important[0] = 1

    # Save the other test object, this time remember buffer
    test_object.save('test_full.msh', full_save=True)

    # Load first test object and print its variables
    print('###########################################################################################################')
    test_object = TestClass.load('test.msh')
    print('Loading previous test object:')
    print('-----------------------------------------------------------------------------------------------------------')
    print_variables(test_object)

    # Load second test object and print its variables
    print('###########################################################################################################')
    test_object = TestClass.load('test_full.msh')
    print('Loading previous test object:')
    print('-----------------------------------------------------------------------------------------------------------')
    print_variables(test_object)

We can see that the content of self.not_important is stored only if the full_save flag is set to true.

The last remark is that the Serializable interface works also in presence of inheritance. If you extend a serializable class, you only need to add the new attributes defined by the child class.