MushroomRL¶
Reinforcement Learning python library¶
MushroomRL is a Reinforcement Learning (RL) library that aims to be a simple, yet powerful way to make RL and deep RL experiments. The idea behind Mushroom consists in offering the majority of RL algorithms providing a common interface in order to run them without excessive effort. Moreover, it is designed in such a way that new algorithms and other stuff can generally be added transparently without the need of editing other parts of the code. MushroomRL makes a large use of the environments provided by OpenAI Gym, DeepMind Control Suite and MuJoCo libraries, and the PyTorch library for tensor computation.
With MushroomRL you can:
 solve RL problems simply writing a single small script;
 add custom algorithms and other stuff transparently;
 use all RL environments offered by wellknown libraries and build customized environments as well;
 exploit regression models offered by ScikitLearn or build a customized one with PyTorch;
 run experiments on GPU.
Basic run example¶
Solve a discrete MDP in few a lines. Firstly, create a MDP:
from mushroom_rl.environments import GridWorld
mdp = GridWorld(width=3, height=3, goal=(2, 2), start=(0, 0))
Then, an epsilongreedy policy with:
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.parameters import Parameter
epsilon = Parameter(value=1.)
policy = EpsGreedy(epsilon=epsilon)
Eventually, the agent is:
from mushroom_rl.algorithms.value import QLearning
learning_rate = Parameter(value=.6)
agent = QLearning(policy, mdp.info, learning_rate)
Learn:
from mushroom_rl.core.core import Core
core = Core(agent, mdp)
core.learn(n_steps=10000, n_steps_per_fit=1)
Print final Qtable:
import numpy as np
shape = agent.approximator.shape
q = np.zeros(shape)
for i in range(shape[0]):
for j in range(shape[1]):
state = np.array([i])
action = np.array([j])
q[i, j] = agent.approximator.predict(state, action)
print(q)
Results in:
[[ 6.561 7.29 6.561 7.29 ]
[ 7.29 8.1 6.561 8.1 ]
[ 8.1 9. 7.29 8.1 ]
[ 6.561 8.1 7.29 8.1 ]
[ 7.29 9. 7.29 9. ]
[ 8.1 10. 8.1 9. ]
[ 7.29 8.1 8.1 9. ]
[ 8.1 9. 8.1 10. ]
[ 0. 0. 0. 0. ]]
where the Qvalues of each action of the MDP are stored for each rows representing a state of the MDP.
Download and installation¶
MushroomRL can be downloaded from the GitHub repository. Installation can be done running
pip3 install mushroom_rl
To compile the documentation:
cd mushroom_rl/docs
make html
or to compile the pdf version:
cd mushroom_rl/docs
make latexpdf
To launch MushroomRL test suite:
pytest
AgentEnvironment Interface¶
The three basic interface of mushroom_rl are the Agent, the Environment and the Core interface.
 The
Agent
is the basic interface for any Reinforcement Learning algorithm.  The
Environment
is the basic interface for every problem/task that the agent should solve.  The
Core
is a class used to control the interaction between an agent and an environment.
To implement serialization of MushroomRL data on the disk (load/save functionality) we also provide the Serializable
interface.
Agent¶
MushroomRL provides the implementations of several algorithms belonging to all categories of RL:
 valuebased;
 policysearch;
 actorcritic.
One can easily implement customized algorithms following the structure of the already available ones, by extending the following interface:

class
mushroom_rl.algorithms.agent.
Agent
(mdp_info, policy, features=None)[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
This class implements the functions to manage the agent (e.g. move the agent following its policy).

draw_action
(state)[source]¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

stop
()[source]¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Environment¶
MushroomRL provides several implementation of well known benchmarks with both continuous and discrete action spaces.
To implement a new environment, it is mandatory to use the following interface:

class
mushroom_rl.environments.environment.
MDPInfo
(observation_space, action_space, gamma, horizon)[source]¶ Bases:
object
This class is used to store the information of the environment.

size
¶ The sum of the number of discrete states and discrete actions. Only works for discrete spaces.
Type: Returns

shape
¶ The concatenation of the shape tuple of the state and action spaces.
Type: Returns


class
mushroom_rl.environments.environment.
Environment
(mdp_info)[source]¶ Bases:
object
Basic interface used by any mushroom environment.

__init__
(mdp_info)[source]¶ Constructor.
Parameters: mdp_info (MDPInfo) – an object containing the info of the environment.

seed
(seed)[source]¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

info
¶ An object containing the info of the environment.
Type: Returns

Core¶

class
mushroom_rl.core.core.
Core
(agent, mdp, callbacks_episode=None, callback_step=None, preprocessors=None)[source]¶ Bases:
object
Implements the functions to run a generic algorithm.

__init__
(agent, mdp, callbacks_episode=None, callback_step=None, preprocessors=None)[source]¶ Constructor.
Parameters:  agent (Agent) – the agent moving according to a policy;
 mdp (Environment) – the environment in which the agent moves;
 callbacks_episode (list) – list of callbacks to execute at the end of each learn iteration;
 callback_step (Callback) – callback to execute after each step;
 preprocessors (list) – list of state preprocessors to be applied to state variables before feeding them to the agent.

learn
(n_steps=None, n_episodes=None, n_steps_per_fit=None, n_episodes_per_fit=None, render=False, quiet=False)[source]¶ This function moves the agent in the environment and fits the policy using the collected samples. The agent can be moved for a given number of steps or a given number of episodes and, independently from this choice, the policy can be fitted after a given number of steps or a given number of episodes. By default, the environment is reset.
Parameters:  n_steps (int, None) – number of steps to move the agent;
 n_episodes (int, None) – number of episodes to move the agent;
 n_steps_per_fit (int, None) – number of steps between each fit of the policy;
 n_episodes_per_fit (int, None) – number of episodes between each fit of the policy;
 render (bool, False) – whether to render the environment or not;
 quiet (bool, False) – whether to show the progress bar or not.

evaluate
(initial_states=None, n_steps=None, n_episodes=None, render=False, quiet=False)[source]¶ This function moves the agent in the environment using its policy. The agent is moved for a provided number of steps, episodes, or from a set of initial states for the whole episode. By default, the environment is reset.
Parameters:  initial_states (np.ndarray, None) – the starting states of each episode;
 n_steps (int, None) – number of steps to move the agent;
 n_episodes (int, None) – number of episodes to move the agent;
 render (bool, False) – whether to render the environment or not;
 quiet (bool, False) – whether to show the progress bar or not.

Serialization¶

class
mushroom_rl.core.serialization.
Serializable
[source]¶ Bases:
object
Interface to implement serialization of a MushroomRL object. This provide load and save functionality to save the object in a zip file. It is possible to save the state of the agent with different levels of

save
(path, full_save=False)[source]¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')[source]¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

classmethod
load
(path)[source]¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

_add_save_attr
(**attr_dict)[source]¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()[source]¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

__init__
¶ Initialize self. See help(type(self)) for accurate signature.

ActorCritic¶
Classical ActorCritic Methods¶

class
mushroom_rl.algorithms.actor_critic.classic_actor_critic.
COPDAC_Q
(mdp_info, policy, mu, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Compatible offpolicy deterministic actorcritic algorithm. “Deterministic Policy Gradient Algorithms”. Silver D. et al.. 2014.

__init__
(mdp_info, policy, mu, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters:  mu (Regressor) – regressor that describe the deterministic policy to be learned i.e., the deterministic mapping between state and action.
 alpha_theta (Parameter) – learning rate for policy update;
 alpha_omega (Parameter) – learning rate for the advantage function;
 alpha_v (Parameter) – learning rate for the value function;
 value_function_features (Features, None) – features used by the value function approximator;
 policy_features (Features, None) – features used by the policy.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.classic_actor_critic.
StochasticAC
(mdp_info, policy, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Stochastic Actor critic in the episodic setting as presented in: “ModelFree Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__
(mdp_info, policy, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters:  alpha_theta (Parameter) – learning rate for policy update;
 alpha_v (Parameter) – learning rate for the value function;
 lambda_par (float, 9) – trace decay parameter;
 value_function_features (Features, None) – features used by the value function approximator;
 policy_features (Features, None) – features used by the policy.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.classic_actor_critic.
StochasticAC_AVG
(mdp_info, policy, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom_rl.algorithms.actor_critic.classic_actor_critic.stochastic_ac.StochasticAC
Stochastic Actor critic in the average reward setting as presented in: “ModelFree Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__
(mdp_info, policy, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters: alpha_r (Parameter) – learning rate for the reward trace.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Deep ActorCritic Methods¶

class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
DeepAC
(mdp_info, policy, actor_optimizer, parameters)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Base class for algorithms that uses the reparametrization trick, such as SAC, DDPG and TD3.

__init__
(mdp_info, policy, actor_optimizer, parameters)[source]¶ Constructor.
Parameters:  actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 parameters – policy parameters to be optimized.

_optimize_actor_parameters
(loss)[source]¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()[source]¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
A2C
(mdp_info, policy, actor_optimizer, critic_params, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Advantage Actor Critic algorithm (A2C). Synchronous version of the A3C algorithm. “Asynchronous Methods for Deep Reinforcement Learning”. Mnih V. et. al.. 2016.

__init__
(mdp_info, policy, actor_optimizer, critic_params, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 ent_coeff (float, 0) – coefficient for the entropy penalty;
 max_grad_norm (float, None) – maximum norm for gradient clipping. If None, no clipping will be performed, unless specified otherwise in actor_optimizer;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_post_load
()[source]¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
DDPG
(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=1, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Deep Deterministic Policy Gradient algorithm. “Continuous Control with Deep Reinforcement Learning”. Lillicrap T. P. et al.. 2016.

__init__
(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=1, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy_class (Policy) – class of the policy;
 policy_params (dict) – parameters of the policy to build;
 actor_params (dict) – parameters of the actor approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 tau (float) – value of coefficient for soft updates;
 policy_delay (int, 1) – the number of updates of the critic after which an actor update is implemented;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator;

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Actionvalues returned by the critic for
next_state
and the action returned by the actor.

_post_load
()[source]¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
TD3
(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.actor_critic.deep_actor_critic.ddpg.DDPG
Twin Delayed DDPG algorithm. “Addressing Function Approximation Error in ActorCritic Methods”. Fujimoto S. et al.. 2018.

__init__
(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy_class (Policy) – class of the policy;
 policy_params (dict) – parameters of the policy to build;
 actor_params (dict) – parameters of the actor approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 tau (float) – value of coefficient for soft updates;
 policy_delay (int, 2) – the number of updates of the critic after which an actor update is implemented;
 noise_std (float, 2) – standard deviation of the noise used for policy smoothing;
 noise_clip (float, 5) – maximum absolute value for policy smoothing noise;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Actionvalues returned by the critic for
next_state
and the action returned by the actor.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
SAC
(mdp_info, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, target_entropy=None, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Soft ActorCritic algorithm. “Soft ActorCritic Algorithms and Applications”. Haarnoja T. et al.. 2019.

__init__
(mdp_info, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, target_entropy=None, critic_fit_params=None)[source]¶ Constructor.
Parameters:  actor_mu_params (dict) – parameters of the actor mean approximator to build;
 actor_sigma_params (dict) – parameters of the actor sigm approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 warmup_transitions (int) – number of samples to accumulate in the replay memory to start the policy fitting;
 tau (float) – value of coefficient for soft updates;
 lr_alpha (float) – Learning rate for the entropy coefficient;
 target_entropy (float, None) – target entropy for the policy, if None a default value is computed ;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Actionvalues returned by the critic for
next_state
and the action returned by the actor.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
TRPO
(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e10, quiet=True, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Trust Region Policy optimization algorithm. “Trust Region Policy Optimization”. Schulman J. et al.. 2015.

__init__
(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e10, quiet=True, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm
 critic_params (dict) – parameters of the critic approximator to build;
 ent_coeff (float, 0) – coefficient for the entropy penalty;
 max_kl (float, 001) – maximum kl allowed for every policy update;
 float (lam) – lambda coefficient used by generalized advantage estimation;
 n_epochs_line_search (int, 10) – maximum number of iterations of the line search algorithm;
 n_epochs_cg (int, 10) – maximum number of iterations of the conjugate gradient algorithm;
 cg_damping (float, 1e2) – damping factor for the conjugate gradient algorithm;
 cg_residual_tol (float, 1e10) – conjugate gradient residual tolerance;
 quiet (bool, True) – if true, the algorithm will print debug information;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.actor_critic.deep_actor_critic.
PPO
(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, quiet=True, critic_fit_params=None)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al.. 2017.

__init__
(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, quiet=True, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 n_epochs_policy (int) – number of policy updates for every dataset;
 batch_size (int) – size of minibatches for every optimization step
 eps_ppo (float) – value for probability ratio clipping;
 float (lam) – lambda coefficient used by generalized advantage estimation;
 quiet (bool, True) – if true, the algorithm will print debug information;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Policy search¶
Policy gradient¶

class
mushroom_rl.algorithms.policy_search.policy_gradient.
REINFORCE
(mdp_info, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
REINFORCE algorithm. “Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning”, Williams R. J.. 1992.

__init__
(mdp_info, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.policy_gradient.
GPOMDP
(mdp_info, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
GPOMDP algorithm. “InfiniteHorizon PolicyGradient Estimation”. Baxter J. and Bartlett P. L.. 2001.

__init__
(mdp_info, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.policy_gradient.
eNAC
(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
Episodic Natural Actor Critic algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J. 2013.

__init__
(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶ Constructor.
Parameters: critic_features (Features, None) – features used by the critic.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

BlackBox optimization¶

class
mushroom_rl.algorithms.policy_search.black_box_optimization.
RWR
(mdp_info, distribution, policy, beta, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
RewardWeighted Regression algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, beta, features=None)[source]¶ Constructor.
Parameters: beta (float) – the temperature for the exponential reward transformation.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.black_box_optimization.
PGPE
(mdp_info, distribution, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Policy Gradient with Parameter Exploration algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (Parameter) – the learning rate for the gradient step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.black_box_optimization.
REPS
(mdp_info, distribution, policy, eps, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Episodic Relative Entropy Policy Search algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, eps, features=None)[source]¶ Constructor.
Parameters: eps (float) – the maximum admissible value for the KullbackLeibler divergence between the new distribution and the previous one at each update step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

ValueBased¶
TD¶

class
mushroom_rl.algorithms.value.td.
SARSA
(mdp_info, policy, learning_rate)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
SARSA algorithm.

__init__
(mdp_info, policy, learning_rate)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 learning_rate (Parameter) – the learning rate.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
SARSALambda
(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
The SARSA(lambda) algorithm for finite MDPs.

__init__
(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]¶ Constructor.
Parameters:  lambda_coeff (float) – eligibility trace coefficient;
 trace (str, 'replacing') – type of eligibility trace to use.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
ExpectedSARSA
(mdp_info, policy, learning_rate)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
Expected SARSA algorithm. “A theoretical and empirical analysis of Expected Sarsa”. Seijen H. V. et al.. 2009.

__init__
(mdp_info, policy, learning_rate)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 learning_rate (Parameter) – the learning rate.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
QLearning
(mdp_info, policy, learning_rate)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
QLearning algorithm. “Learning from Delayed Rewards”. Watkins C.J.C.H.. 1989.

__init__
(mdp_info, policy, learning_rate)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 learning_rate (Parameter) – the learning rate.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
DoubleQLearning
(mdp_info, policy, learning_rate)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
Double QLearning algorithm. “Double QLearning”. Hasselt H. V.. 2010.

__init__
(mdp_info, policy, learning_rate)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 learning_rate (Parameter) – the learning rate.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
SpeedyQLearning
(mdp_info, policy, learning_rate)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
Speedy QLearning algorithm. “Speedy QLearning”. Ghavamzadeh et. al.. 2011.

__init__
(mdp_info, policy, learning_rate)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 learning_rate (Parameter) – the learning rate.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
RLearning
(mdp_info, policy, learning_rate, beta)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
RLearning algorithm. “A Reinforcement Learning Method for Maximizing Undiscounted Rewards”. Schwartz A.. 1993.

__init__
(mdp_info, policy, learning_rate, beta)[source]¶ Constructor.
Parameters: beta (Parameter) – beta coefficient.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
WeightedQLearning
(mdp_info, policy, learning_rate, sampling=True, precision=1000)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
Weighted QLearning algorithm. “Estimating the Maximum Expected Value through Gaussian Approximation”. D’Eramo C. et. al.. 2016.

__init__
(mdp_info, policy, learning_rate, sampling=True, precision=1000)[source]¶ Constructor.
Parameters:  sampling (bool, True) – use the approximated version to speed up the computation;
 precision (int, 1000) – number of samples to use in the approximated version.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_next_q
(next_state)[source]¶ Parameters: next_state (np.ndarray) – the state where next action has to be evaluated. Returns: The weighted estimator value in next_state
.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
RQLearning
(mdp_info, policy, learning_rate, off_policy=False, beta=None, delta=None)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
RQLearning algorithm. “Exploiting Structure and Uncertainty of Bellman Updates in Markov Decision Processes”. Tateo D. et al.. 2017.

__init__
(mdp_info, policy, learning_rate, off_policy=False, beta=None, delta=None)[source]¶ Constructor.
Parameters:

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
SARSALambdaContinuous
(mdp_info, policy, approximator, learning_rate, lambda_coeff, features, approximator_params=None)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
Continuous version of SARSA(lambda) algorithm.

__init__
(mdp_info, policy, approximator, learning_rate, lambda_coeff, features, approximator_params=None)[source]¶ Constructor.
Parameters: lambda_coeff (float) – eligibility trace coefficient.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.td.
TrueOnlineSARSALambda
(mdp_info, policy, learning_rate, lambda_coeff, features, approximator_params=None)[source]¶ Bases:
mushroom_rl.algorithms.value.td.td.TD
True Online SARSA(lambda) with linear function approximation. “True Online TD(lambda)”. Seijen H. V. et al.. 2014.

__init__
(mdp_info, policy, learning_rate, lambda_coeff, features, approximator_params=None)[source]¶ Constructor.
Parameters: lambda_coeff (float) – eligibility trace coefficient.

_update
(state, action, reward, next_state, absorbing)[source]¶ Update the Qtable.
Parameters:  state (np.ndarray) – state;
 action (np.ndarray) – action;
 reward (np.ndarray) – reward;
 next_state (np.ndarray) – next state;
 absorbing (np.ndarray) – absorbing flag.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

static
_parse
(dataset)¶ Utility to parse the dataset that is supposed to contain only a sample.
Parameters: dataset (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Batch TD¶

class
mushroom_rl.algorithms.value.batch_td.
FQI
(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False, boosted=False)[source]¶ Bases:
mushroom_rl.algorithms.value.batch_td.batch_td.BatchTD
Fitted QIteration algorithm. “TreeBased Batch Mode Reinforcement Learning”, Ernst D. et al.. 2005.

__init__
(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False, boosted=False)[source]¶ Constructor.
Parameters:  n_iterations (int) – number of iterations to perform for training;
 quiet (bool, False) – whether to show the progress bar or not;
 boosted (bool, False) – whether to use boosted FQI or not.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.batch_td.
DoubleFQI
(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]¶ Bases:
mushroom_rl.algorithms.value.batch_td.fqi.FQI
Double Fitted QIteration algorithm. “Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems”. D’Eramo C. et al.. 2017.

__init__
(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]¶ Constructor.
Parameters:  n_iterations (int) – number of iterations to perform for training;
 quiet (bool, False) – whether to show the progress bar or not;
 boosted (bool, False) – whether to use boosted FQI or not.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_fit_boosted
(x)¶ Single fit iteration for boosted FQI.
Parameters: x (list) – the dataset.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit loop.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.batch_td.
LSPI
(mdp_info, policy, approximator_params=None, epsilon=0.01, fit_params=None, features=None)[source]¶ Bases:
mushroom_rl.algorithms.value.batch_td.batch_td.BatchTD
LeastSquares Policy Iteration algorithm. “LeastSquares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__
(mdp_info, policy, approximator_params=None, epsilon=0.01, fit_params=None, features=None)[source]¶ Constructor.
Parameters: epsilon (float, 1e2) – termination coefficient.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

DQN¶

class
mushroom_rl.algorithms.value.dqn.
DQN
(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, n_approximators=1, clip_reward=True)[source]¶ Bases:
mushroom_rl.algorithms.agent.Agent
Deep QNetwork algorithm. “HumanLevel Control Through Deep Reinforcement Learning”. Mnih V. et al.. 2015.

__init__
(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, n_approximators=1, clip_reward=True)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 approximator_params (dict) – parameters of the approximator to build;
 batch_size (int) – the number of samples in a batch;
 target_update_frequency (int) – the number of samples collected between each update of the target network;
 replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
 n_approximators (int, 1) – the number of approximator to use in
AveragedDQN
;  clip_reward (bool, True) – whether to clip the reward or not.

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Maximum actionvalue for each state in
next_state
.

draw_action
(state)[source]¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

_post_load
()[source]¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

copy
()¶ Returns: A deepcopy of the agent.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.dqn.
DoubleDQN
(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, n_approximators=1, clip_reward=True)[source]¶ Bases:
mushroom_rl.algorithms.value.dqn.dqn.DQN
Double DQN algorithm. “Deep Reinforcement Learning with Double QLearning”. Hasselt H. V. et al.. 2016.

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Maximum actionvalue for each state in
next_state
.

__init__
(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, n_approximators=1, clip_reward=True)¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 approximator_params (dict) – parameters of the approximator to build;
 batch_size (int) – the number of samples in a batch;
 target_update_frequency (int) – the number of samples collected between each update of the target network;
 replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
 n_approximators (int, 1) – the number of approximator to use in
AveragedDQN
;  clip_reward (bool, True) – whether to clip the reward or not.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target
()¶ Update the target network.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.dqn.
AveragedDQN
(mdp_info, policy, approximator, **params)[source]¶ Bases:
mushroom_rl.algorithms.value.dqn.dqn.DQN
AveragedDQN algorithm. “AveragedDQN: Variance Reduction and Stabilization for Deep Reinforcement Learning”. Anschel O. et al.. 2017.

__init__
(mdp_info, policy, approximator, **params)[source]¶ Constructor.
Parameters:  approximator (object) – the approximator to use to fit the Qfunction;
 approximator_params (dict) – parameters of the approximator to build;
 batch_size (int) – the number of samples in a batch;
 target_update_frequency (int) – the number of samples collected between each update of the target network;
 replay_memory ([ReplayMemory, PrioritizedReplayMemory], None) – the object of the replay memory to use; if None, a default replay memory is created;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
 n_approximators (int, 1) – the number of approximator to use in
AveragedDQN
;  clip_reward (bool, True) – whether to clip the reward or not.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.value.dqn.
CategoricalDQN
(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, **params)[source]¶ Bases:
mushroom_rl.algorithms.value.dqn.dqn.DQN
Categorical DQN algorithm. “A Distributional Perspective on Reinforcement Learning”. Bellemare M. et al.. 2017.

__init__
(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, **params)[source]¶ Constructor.
Parameters:  n_atoms (int) – number of atoms;
 v_min (float) – minimum value of valuefunction;
 v_max (float) – maximum value of valuefunction.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_next_q
(next_state, absorbing)¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Maximum actionvalue for each state in
next_state
.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_target
()¶ Update the target network.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Approximators¶
MushroomRL exposes the highlevel class Regressor
that can manage any type of
function regressor. This class is a wrapper for any kind of function
approximator, e.g. a scikitlearn approximator or a pytorch neural network.
Regressor¶

class
mushroom_rl.approximators.regressor.
Regressor
(approximator, input_shape, output_shape=(1, ), n_actions=None, n_models=1, **params)[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
This class implements the function to manage a function approximator. This class selects the appropriate kind of regressor to implement according to the parameters provided by the user; this makes this class the only one to use for each kind of task that has to be performed. The inference of the implementation to choose is done checking the provided values of parameters
n_actions
. Ifn_actions
is provided, it means that the user wants to implement an approximator of the Qfunction: if the value ofn_actions
is equal to theoutput_shape
then aQRegressor
is created, else (output_shape
should be (1,)) anActionRegressor
is created. Otherwise aGenericRegressor
is created. AnEnsemble
model can be used for all the previous implementations listed before simply providing an_models
parameter greater than 1.
__init__
(approximator, input_shape, output_shape=(1, ), n_actions=None, n_models=1, **params)[source]¶ Constructor.
Parameters:  approximator (class) – the approximator class to use to create the model;
 input_shape (tuple) – the shape of the input of the model;
 output_shape (tuple, (1,)) – the shape of the output of the model;
 n_actions (int, None) – number of actions considered to create a
QRegressor
or anActionRegressor
;  n_models (int, 1) – number of models to create;
 **params – other parameters to create each model.

fit
(*z, **fit_params)[source]¶ Fit the model.
Parameters:  *z – list of input of the model;
 **fit_params – parameters to use to fit the model.

predict
(*z, **predict_params)[source]¶ Predict the output of the model given an input.
Parameters:  *z – list of input of the model;
 **predict_params – parameters to use to predict with the model.
Returns: The model prediction.

model
¶ The model object.
Type: Returns

input_shape
¶ The shape of the input of the model.
Type: Returns

output_shape
¶ The shape of the output of the model.
Type: Returns

weights_size
¶ The shape of the weights of the model.
Type: Returns

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Approximator¶
Linear¶

class
mushroom_rl.approximators.parametric.linear.
LinearApproximator
(weights=None, input_shape=None, output_shape=(1, ), **kwargs)[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
This class implements a linear approximator.

__init__
(weights=None, input_shape=None, output_shape=(1, ), **kwargs)[source]¶ Constructor.
Parameters:  weights (np.ndarray) – array of weights to initialize the weights of the approximator;
 input_shape (np.ndarray, None) – the shape of the input of the model;
 output_shape (np.ndarray, (1,)) – the shape of the output of the model;
 **kwargs – other params of the approximator.

fit
(x, y, **fit_params)[source]¶ Fit the model.
Parameters:  x (np.ndarray) – input;
 y (np.ndarray) – target;
 **fit_params – other parameters used by the fit method of the regressor.

predict
(x, **predict_params)[source]¶ Predict.
Parameters:  x (np.ndarray) – input;
 **predict_params – other parameters used by the predict method the regressor.
Returns: The predictions of the model.

weights_size
¶ The size of the array of weights.
Type: Returns

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action=None)[source]¶ Compute the derivative of the output w.r.t.
state
, andaction
if provided.Parameters:  state (np.ndarray) – the state;
 action (np.ndarray, None) – the action.
Returns: The derivative of the output w.r.t.
state
, andaction
if provided.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Torch Approximator¶

class
mushroom_rl.approximators.parametric.torch_approximator.
TorchApproximator
(input_shape, output_shape, network, optimizer=None, loss=None, batch_size=0, n_fit_targets=1, use_cuda=False, reinitialize=False, dropout=False, quiet=True, **params)[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
Class to interface a pytorch model to the mushroom Regressor interface. This class implements all is needed to use a generic pytorch model and train it using a specified optimizer and objective function. This class supports also minibatches.

__init__
(input_shape, output_shape, network, optimizer=None, loss=None, batch_size=0, n_fit_targets=1, use_cuda=False, reinitialize=False, dropout=False, quiet=True, **params)[source]¶ Constructor.
Parameters:  input_shape (tuple) – shape of the input of the network;
 output_shape (tuple) – shape of the output of the network;
 network (torch.nn.Module) – the network class to use;
 optimizer (dict) – the optimizer used for every fit step;
 loss (torch.nn.functional) – the loss function to optimize in the fit method;
 batch_size (int, 0) – the size of each minibatch. If 0, the whole dataset is fed to the optimizer at each epoch;
 n_fit_targets (int, 1) – the number of fit targets used by the fit method of the network;
 use_cuda (bool, False) – if True, runs the network on the GPU;
 reinitialize (bool, False) – if True, the approximator is re
 at every fit call. To perform the initialization, the (initialized) –
 method must be defined properly for the selected (weights_init) –
 network. (model) –
 dropout (bool, False) – if True, dropout is applied only during train;
 quiet (bool, True) – if False, shows two progress bars, one for epochs and one for the minibatches;
 **params – dictionary of parameters needed to construct the network.

predict
(*args, output_tensor=False, **kwargs)[source]¶ Predict.
Parameters:  *args – input;
 output_tensor (bool, False) – whether to return the output as tensor or not;
 **kwargs – other parameters used by the predict method the regressor.
Returns: The predictions of the model.

fit
(*args, n_epochs=None, weights=None, epsilon=None, patience=1, validation_split=1.0, **kwargs)[source]¶ Fit the model.
Parameters:  *args – input, where the last
n_fit_targets
elements are considered as the target, while the others are considered as input;  n_epochs (int, None) – the number of training epochs;
 weights (np.ndarray, None) – the weights of each sample in the computation of the loss;
 epsilon (float, None) – the coefficient used for early stopping;
 patience (float, 1.) – the number of epochs to wait until stop the learning if not improving;
 validation_split (float, 1.) – the percentage of the dataset to use as training set;
 **kwargs – other parameters used by the fit method of the regressor.
 *args – input, where the last

weights_size
¶ The size of the array of weights.
Type: Returns

diff
(*args, **kwargs)[source]¶ Compute the derivative of the output w.r.t.
state
, andaction
if provided.Parameters:  state (np.ndarray) – the state;
 action (np.ndarray, None) – the action.
Returns: The derivative of the output w.r.t.
state
, andaction
if provided.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Distributions¶

class
mushroom_rl.distributions.distribution.
Distribution
[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
Interface for Distributions to represent a generic probability distribution. Probability distributions are often used by black box optimization algorithms in order to perform exploration in parameter space. In literature, they are also known as high level policies.

sample
()[source]¶ Draw a sample from the distribution.
Returns: A random vector sampled from the distribution.

log_pdf
(theta)[source]¶ Compute the logarithm of the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the log pdf is calculated Returns: The value of the log pdf in the specified point.

__call__
(theta)[source]¶ Compute the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the pdf is calculated Returns: The value of the pdf in the specified point.

mle
(theta, weights=None)[source]¶ Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.
Parameters:  theta (np.ndarray) – a set of points, every row is a sample
 weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.

diff_log
(theta)[source]¶ Compute the derivative of the gradient of the probability denstity function in the specified point.
Parameters:  theta (np.ndarray) – the point where the gradient of the log pdf is
 calculated –
Returns: The gradient of the log pdf in the specified point.

diff
(theta)[source]¶ Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]Parameters:  theta (np.ndarray) – the point where the gradient of the pdf is
 calculated. –
Returns: The gradient of the pdf in the specified point.

set_parameters
(rho)[source]¶ Setter.
Parameters: rho (np.ndarray) – the vector of the new parameters to be used by the distribution

parameters_size
¶ Property.
Returns: The size of the distribution parameters.

__init__
¶ Initialize self. See help(type(self)) for accurate signature.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Gaussian¶

class
mushroom_rl.distributions.gaussian.
GaussianDistribution
(mu, sigma)[source]¶ Bases:
mushroom_rl.distributions.distribution.Distribution
Gaussian distribution with fixed covariance matrix. The parameters vector represents only the mean.

__init__
(mu, sigma)[source]¶ Constructor.
Parameters:  mu (np.ndarray) – initial mean of the distribution;
 sigma (np.ndarray) – covariance matrix of the distribution.

sample
()[source]¶ Draw a sample from the distribution.
Returns: A random vector sampled from the distribution.

log_pdf
(theta)[source]¶ Compute the logarithm of the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the log pdf is calculated Returns: The value of the log pdf in the specified point.

__call__
(theta)[source]¶ Compute the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the pdf is calculated Returns: The value of the pdf in the specified point.

mle
(theta, weights=None)[source]¶ Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.
Parameters:  theta (np.ndarray) – a set of points, every row is a sample
 weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.

diff_log
(theta)[source]¶ Compute the derivative of the gradient of the probability denstity function in the specified point.
Parameters:  theta (np.ndarray) – the point where the gradient of the log pdf is
 calculated –
Returns: The gradient of the log pdf in the specified point.

set_parameters
(rho)[source]¶ Setter.
Parameters: rho (np.ndarray) – the vector of the new parameters to be used by the distribution

parameters_size
¶ Property.
Returns: The size of the distribution parameters.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(theta)¶ Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]Parameters:  theta (np.ndarray) – the point where the gradient of the pdf is
 calculated. –
Returns: The gradient of the pdf in the specified point.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.distributions.gaussian.
GaussianDiagonalDistribution
(mu, std)[source]¶ Bases:
mushroom_rl.distributions.distribution.Distribution
Gaussian distribution with diagonal covariance matrix. The parameters vector represents the mean and the standard deviation for each dimension.

__init__
(mu, std)[source]¶ Constructor.
Parameters:  mu (np.ndarray) – initial mean of the distribution;
 std (np.ndarray) – initial vector of standard deviations for each variable of the distribution.

sample
()[source]¶ Draw a sample from the distribution.
Returns: A random vector sampled from the distribution.

log_pdf
(theta)[source]¶ Compute the logarithm of the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the log pdf is calculated Returns: The value of the log pdf in the specified point.

__call__
(theta)[source]¶ Compute the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the pdf is calculated Returns: The value of the pdf in the specified point.

mle
(theta, weights=None)[source]¶ Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.
Parameters:  theta (np.ndarray) – a set of points, every row is a sample
 weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.

diff_log
(theta)[source]¶ Compute the derivative of the gradient of the probability denstity function in the specified point.
Parameters:  theta (np.ndarray) – the point where the gradient of the log pdf is
 calculated –
Returns: The gradient of the log pdf in the specified point.

set_parameters
(rho)[source]¶ Setter.
Parameters: rho (np.ndarray) – the vector of the new parameters to be used by the distribution

parameters_size
¶ Property.
Returns: The size of the distribution parameters.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(theta)¶ Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]Parameters:  theta (np.ndarray) – the point where the gradient of the pdf is
 calculated. –
Returns: The gradient of the pdf in the specified point.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.distributions.gaussian.
GaussianCholeskyDistribution
(mu, sigma)[source]¶ Bases:
mushroom_rl.distributions.distribution.Distribution
Gaussian distribution with full covariance matrix. The parameters vector represents the mean and the Cholesky decomposition of the covariance matrix. This parametrization enforce the covariance matrix to be positive definite.

__init__
(mu, sigma)[source]¶ Constructor.
Parameters:  mu (np.ndarray) – initial mean of the distribution;
 sigma (np.ndarray) – initial covariance matrix of the distribution.

sample
()[source]¶ Draw a sample from the distribution.
Returns: A random vector sampled from the distribution.

log_pdf
(theta)[source]¶ Compute the logarithm of the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the log pdf is calculated Returns: The value of the log pdf in the specified point.

__call__
(theta)[source]¶ Compute the probability density function in the specified point
Parameters: theta (np.ndarray) – the point where the pdf is calculated Returns: The value of the pdf in the specified point.

mle
(theta, weights=None)[source]¶ Compute the (weighted) maximum likelihood estimate of the points, and update the distribution accordingly.
Parameters:  theta (np.ndarray) – a set of points, every row is a sample
 weights (np.ndarray, None) – a vector of weights. If specified the weighted maximum likelihood estimate is computed instead of the plain maximum likelihood. The number of elements of this vector must be equal to the number of rows of the theta matrix.

diff_log
(theta)[source]¶ Compute the derivative of the gradient of the probability denstity function in the specified point.
Parameters:  theta (np.ndarray) – the point where the gradient of the log pdf is
 calculated –
Returns: The gradient of the log pdf in the specified point.

set_parameters
(rho)[source]¶ Setter.
Parameters: rho (np.ndarray) – the vector of the new parameters to be used by the distribution

parameters_size
¶ Property.
Returns: The size of the distribution parameters.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(theta)¶ Compute the derivative of the probability density function, in the specified point. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\rho}p(\theta)=p(\theta)\nabla_{\rho}\log p(\theta)\]Parameters:  theta (np.ndarray) – the point where the gradient of the pdf is
 calculated. –
Returns: The gradient of the pdf in the specified point.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Environments¶
In mushroom_rl we distinguish between two different types of environment classes:
 proper environments
 generators
While environments directly implement the Environment
interface, generators
are a set of methods used to generate finite markov chains that represent a
specific environment e.g., grid worlds.
Environments¶
Atari¶

class
mushroom_rl.environments.atari.
MaxAndSkip
(env, skip, max_pooling=True)[source]¶ Bases:
gym.core.Wrapper

__init__
(env, skip, max_pooling=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.

step
(action)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
Parameters: action (object) – an action provided by the agent Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning) Return type: observation (object)

reset
(**kwargs)[source]¶ Resets the state of the environment and returns an initial observation.
Returns: the initial observation. Return type: observation (object)

close
()¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.

render
(mode='human', **kwargs)¶ Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
 human: render to the current display or terminal and return nothing. Usually for human consumption.
 rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an xbyy pixel image, suitable for turning into a video.
 ansi: Return a string (str) or StringIO.StringIO containing a terminalstyle text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note
 Make sure that your class’s metadata ‘render.modes’ key includes
 the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Parameters: mode (str) – the mode to render with Example:
 class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
 def render(self, mode=’human’):
 if mode == ‘rgb_array’:
 return np.array(…) # return RGB frame suitable for video
 elif mode == ‘human’:
 … # pop up a window and render
 else:
 super(MyEnv, self).render(mode=mode) # just raise an exception

seed
(seed=None)¶ Sets the seed for this env’s random number generator(s).
Note
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:  Returns the list of seeds used in this env’s random
 number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
Return type: list<bigint>

unwrapped
¶ Completely unwrap this env.
Returns: The base nonwrapped gym.Env instance Return type: gym.Env


class
mushroom_rl.environments.atari.
LazyFrames
(frames, history_length)[source]¶ Bases:
object
From OpenAI Baseline. https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py

class
mushroom_rl.environments.atari.
Atari
(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Atari environment as presented in: “Humanlevel control through deep reinforcement learning”. Mnih et. al.. 2015.

__init__
(name, width=84, height=84, ends_at_life=False, max_pooling=True, history_length=4, max_no_op_actions=30)[source]¶ Constructor.
Parameters:  name (str) – id name of the Atari game in Gym;
 width (int, 84) – width of the screen;
 height (int, 84) – height of the screen;
 ends_at_life (bool, False) – whether the episode ends when a life is lost or not;
 max_pooling (bool, True) – whether to do maxpooling or averagepooling of the last two frames when using NoFrameskip;
 history_length (int, 4) – number of frames to form a state;
 max_no_op_actions (int, 30) – maximum number of noop action to execute at the beginning of an episode.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Car on hill¶

class
mushroom_rl.environments.car_on_hill.
CarOnHill
(horizon=100, gamma=0.95)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Car On Hill environment as presented in: “TreeBased Batch Mode Reinforcement Learning”. Ernst D. et al.. 2005.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

DeepMind Control Suite¶

class
mushroom_rl.environments.dm_control_env.
DMControl
(domain_name, task_name, horizon, gamma, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Interface for dm_control suite Mujoco environments. It makes it possible to use every dm_control suite Mujoco environment just providing the necessary information.

__init__
(domain_name, task_name, horizon, gamma, task_kwargs=None, dt=0.01, width_screen=480, height_screen=480, camera_id=0)[source]¶ Constructor.
Parameters:  domain_name (str) – name of the environment;
 task_name (str) – name of the task of the environment;
 horizon (int) – the horizon;
 gamma (float) – the discount factor;
 task_kwargs (dict, None) – parameters of the task;
 dt (float, 01) – duration of a control step;
 width_screen (int, 480) – width of the screen;
 height_screen (int, 480) – height of the screen;
 camera_id (int, 0) – position of camera to render the environment;

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Finite MDP¶

class
mushroom_rl.environments.finite_mdp.
FiniteMDP
(p, rew, mu=None, gamma=0.9, horizon=inf)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Finite Markov Decision Process.

__init__
(p, rew, mu=None, gamma=0.9, horizon=inf)[source]¶ Constructor.
Parameters:  p (np.ndarray) – transition probability matrix;
 rew (np.ndarray) – reward matrix;
 mu (np.ndarray, None) – initial state probability distribution;
 gamma (float, 9) – discount factor;
 horizon (int, np.inf) – the horizon.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

Grid World¶

class
mushroom_rl.environments.grid_world.
AbstractGridWorld
(mdp_info, height, width, start, goal)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Abstract class to build a grid world.

__init__
(mdp_info, height, width, start, goal)[source]¶ Constructor.
Parameters:  height (int) – height of the grid;
 width (int) – width of the grid;
 start (tuple) – xy coordinates of the goal;
 goal (tuple) – xy coordinates of the goal.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering


class
mushroom_rl.environments.grid_world.
GridWorld
(height, width, goal, start=(0, 0))[source]¶ Bases:
mushroom_rl.environments.grid_world.AbstractGridWorld
Standard grid world.

__init__
(height, width, goal, start=(0, 0))[source]¶ Constructor.
Parameters:  height (int) – height of the grid;
 width (int) – width of the grid;
 start (tuple) – xy coordinates of the goal;
 goal (tuple) – xy coordinates of the goal.

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

reset
(state=None)¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

step
(action)¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering


class
mushroom_rl.environments.grid_world.
GridWorldVanHasselt
(height=3, width=3, goal=(0, 2), start=(2, 0))[source]¶ Bases:
mushroom_rl.environments.grid_world.AbstractGridWorld
A variant of the grid world as presented in: “Double QLearning”. Hasselt H. V.. 2010.

__init__
(height=3, width=3, goal=(0, 2), start=(2, 0))[source]¶ Constructor.
Parameters:  height (int) – height of the grid;
 width (int) – width of the grid;
 start (tuple) – xy coordinates of the goal;
 goal (tuple) – xy coordinates of the goal.

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

reset
(state=None)¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

step
(action)¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

Gym¶

class
mushroom_rl.environments.gym_env.
Gym
(name, horizon, gamma)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Interface for OpenAI Gym environments. It makes it possible to use every Gym environment just providing the id, except for the Atari games that are managed in a separate class.

__init__
(name, horizon, gamma)[source]¶ Constructor.
Parameters:  name (str) – gym id of the environment;
 horizon (int) – the horizon;
 gamma (float) – the discount factor.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Inverted pendulum¶

class
mushroom_rl.environments.inverted_pendulum.
InvertedPendulum
(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Inverted Pendulum environment (continuous version) as presented in: “Reinforcement Learning In Continuous Time and Space”. Doya K.. 2000. “OffPolicy ActorCritic”. Degris T. et al.. 2012. “Deterministic Policy Gradient Algorithms”. Silver D. et al. 2014.

__init__
(random_start=False, m=1.0, l=1.0, g=9.8, mu=0.01, max_u=5.0, horizon=5000, gamma=0.99)[source]¶ Constructor.
Parameters:  random_start (bool, False) – whether to start from a random position or from the horizontal one;
 m (float, 1.0) – mass of the pendulum;
 l (float, 1.0) – length of the pendulum;
 g (float, 9.8) – gravity acceleration constant;
 mu (float, 1e2) – friction constant of the pendulum;
 max_u (float, 5.0) – maximum allowed input torque;
 horizon (int, 5000) – horizon of the problem;
 gamma (int, 99) – discount factor.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Cart Pole¶

class
mushroom_rl.environments.cart_pole.
CartPole
(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Inverted Pendulum on a Cart environment as presented in: “LeastSquares Policy Iteration”. Lagoudakis M. G. and Parr R.. 2003.

__init__
(m=2.0, M=8.0, l=0.5, g=9.8, mu=0.01, max_u=50.0, noise_u=10.0, horizon=3000, gamma=0.95)[source]¶ Constructor.
Parameters:  m (float, 2.0) – mass of the pendulum;
 M (float, 8.0) – mass of the cart;
 l (float, 5) – length of the pendulum;
 g (float, 9.8) – gravity acceleration constant;
 mu (float, 1e2) – friction constant of the pendulum;
 max_u (float, 50.) – maximum allowed input torque;
 noise_u (float, 10.) – maximum noise on the action;
 horizon (int, 3000) – horizon of the problem;
 gamma (int, 95) – discount factor.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

LQR¶

class
mushroom_rl.environments.lqr.
LQR
(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
This class implements a LinearQuadratic Regulator. This task aims to minimize the undesired deviations from nominal values of some controller settings in control problems. The system equations in this task are:
\[x_{t+1} = Ax_t + Bu_t\]where x is the state and u is the control signal.
The reward function is given by:
\[r_t = \left( x_t^TQx_t + u_t^TRu_t \right)\]“Policy gradient approaches for multiobjective sequential decision making”. Parisi S., Pirotta M., Smacchia N., Bascetta L., Restelli M.. 2014

__init__
(A, B, Q, R, max_pos=inf, max_action=inf, random_init=False, episodic=False, gamma=0.9, horizon=50)[source]¶ Constructor.
 Args:
 A (np.ndarray): the state dynamics matrix; B (np.ndarray): the action dynamics matrix; Q (np.ndarray): reward weight matrix for state; R (np.ndarray): reward weight matrix for action; max_pos (float, np.inf): maximum value of the state; max_action (float, np.inf): maximum value of the action; random_init (bool, False): start from a random state; episodic (bool, False): end the episode when the state goes over the threshold; gamma (float, 0.9): discount factor; horizon (int, 50): horizon of the mdp.

static
generate
(dimensions, max_pos=inf, max_action=inf, eps=0.1, index=0, scale=1.0, random_init=False, episodic=False, gamma=0.9, horizon=50)[source]¶ Factory method that generates an lqr with identity dynamics and symmetric reward matrices.
Parameters:  dimensions (int) – number of stateaction dimensions;
 max_pos (float, np.inf) – maximum value of the state;
 max_action (float, np.inf) – maximum value of the action;
 eps (double, 1) – reward matrix weights specifier;
 index (int, 0) – selector for the principal state;
 scale (float, 1.0) – scaling factor for the reward function;
 random_init (bool, False) – start from a random state;
 episodic (bool, False) – end the episode when the state goes over the threshold;
 gamma (float, 9) – discount factor;
 horizon (int, 50) – horizon of the mdp.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

Mujoco¶

class
mushroom_rl.environments.mujoco.
ObservationType
[source]¶ Bases:
enum.Enum
An enum indicating the type of data that should be added to the observation of the environment, can be Joint/Body/Site positions and velocities.

class
mushroom_rl.environments.mujoco.
MuJoCo
(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Class to create a Mushroom environment using the MuJoCo simulator.

__init__
(file_name, actuation_spec, observation_spec, gamma, horizon, n_substeps=1, n_intermediate_steps=1, additional_data_spec=None, collision_groups=None)[source]¶ Constructor.
Parameters:  file_name (string) – The path to the XML file with which the environment should be created;
 actuation_spec (list) – A list specifying the names of the joints which should be controllable by the agent. Can be left empty when all actuators should be used;
 observation_spec (list) – A list containing the names of data that should be made available to the agent as an observation and their type (ObservationType). An entry in the list is given by: (name, type);
 gamma (float) – The discounting factor of the environment;
 horizon (int) – The maximum horizon for the environment;
 n_substeps (int) – The number of substeps to use by the MuJoCo simulator. An action given by the agent will be applied for n_substeps before the agent receives the next observation and can act accordingly;
 n_intermediate_steps (int) – The number of steps between every action taken by the agent. Similar to n_substeps but allows the user to modify, control and access intermediate states.
 additional_data_spec (list) – A list containing the data fields of interest, which should be read from or written to during simulation. The entries are given as the following tuples: (key, name, type) key is a string for later referencing in the “read_data” and “write_data” methods. The name is the name of the object in the XML specification and the type is the ObservationType;
 collision_groups (list) – A list containing groups of geoms for
which collisions should be checked during simulation via
check_collision
. The entries are given as:(key, geom_names)
, where key is a string for later referencing in the “check_collision” method, and geom_names is a list of geom names in the XML specification.

seed
(seed)[source]¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

_preprocess_action
(action)[source]¶ Compute a transformation of the action provided to the environment.
Parameters: action (np.ndarray) – numpy array with the actions provided to the environment. Returns: The action to be used for the current step

_compute_action
(action)[source]¶ Compute a transformation of the action at every intermediate step. Useful to add control signals simulated directly in python.
Parameters: action (np.ndarray) – numpy array with the actions provided at every step. Returns: The action to be set in the actual mujoco simulation.

_simulation_pre_step
()[source]¶  Allows information to be accesed and changed at every intermediate step
 before taking a step in the mujoco simulation. Can be usefull to apply an external force/torque to the specified bodies.
 ex: apply a force over X to the torso:
 force = [200, 0, 0] torque = [0, 0, 0] self.sim.data.xfrc_applied[self.sim.model._body_name2id[“torso”],:] = force + torque

_simulation_post_step
()[source]¶  Allows information to be accesed at every intermediate step
 after taking a step in the mujoco simulation. Can be usefull to average forces over all intermediate steps.

_read_data
(name)[source]¶ Read data form the MuJoCo data structure.
Parameters: name (string) – A name referring to an entry contained the additional_data_spec list handed to the constructor. Returns: The desired data as a onedimensional numpy array.

_write_data
(name, value)[source]¶ Write data to the MuJoCo data structure.
Parameters:  name (string) – A name referring to an entry contained in the additional_data_spec list handed to the constructor;
 value (ndarray) – The data that should be written.

_check_collision
(group1, group2)[source]¶ Check for collision between the specified groups.
Parameters:  group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
 group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.
Returns: A flag indicating whether a collision occurred between the given groups or not.

_get_collision_force
(group1, group2)[source]¶ Returns the collision force and torques between the specified groups.
Parameters:  group1 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor;
 group2 (string) – A name referring to an entry contained in the collision_groups list handed to the constructor.
Returns: A 6D vector specifying the collision forces/torques[3D force + 3D torque] between the given groups. Vector of 0’s in case there was no collision. http://mujoco.org/book/programming.html#siContact

_reward
(state, action, next_state)[source]¶ Compute the reward based on the given transition.
Parameters:  state (np.array) – the current state of the system;
 action (np.array) – the action that is applied in the current state;
 next_state (np.array) – the state reached after applying the given action.
Returns: The reward as a floating point scalar value.

_is_absorbing
(state)[source]¶ Check whether the given state is an absorbing state or not.
Parameters: state (np.array) – the state of the system. Returns: A boolean flag indicating whether this state is absorbing or not.

_load_simulation
(file_name, n_substeps)[source]¶ Load mujoco model. Can be overridden to provide custom load functions.
Parameters: file_name – The path to the XML file with which the environment should be created; Returns: The loaded mujoco model.

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

Puddle World¶

class
mushroom_rl.environments.puddle_world.
PuddleWorld
(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
Puddle world as presented in: “OffPolicy ActorCritic”. Degris T. et al.. 2012.

__init__
(start=None, goal=None, goal_threshold=0.1, noise_step=0.025, noise_reward=0, reward_goal=0.0, thrust=0.05, puddle_center=None, puddle_width=None, gamma=0.99, horizon=5000)[source]¶ Constructor.
Parameters:  start (np.array, None) – starting position of the agent;
 goal (np.array, None) – goal position;
 goal_threshold (float, 1) – distance threshold of the agent from the goal to consider it reached;
 noise_step (float, 025) – noise in actions;
 noise_reward (float, 0) – standard deviation of gaussian noise in reward;
 reward_goal (float, 0) – reward obtained reaching goal state;
 thrust (float, 05) – distance walked during each action;
 puddle_center (np.array, None) – center of the puddle;
 puddle_width (np.array, None) – width of the puddle;

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Segway¶

class
mushroom_rl.environments.segway.
Segway
(random_start=False)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Segway environment (continuous version) as presented in: “Deep Learning for ActorCritic Reinforcement Learning”. Xueli Jia. 2015.

__init__
(random_start=False)[source]¶ Constructor.
Parameters: random_start (bool, False) – whether to start from a random position or from the horizontal one.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

stop
()¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

Ship steering¶

class
mushroom_rl.environments.ship_steering.
ShipSteering
(small=True, n_steps_action=3)[source]¶ Bases:
mushroom_rl.environments.environment.Environment
The Ship Steering environment as presented in: “Hierarchical Policy Gradient Algorithms”. Ghavamzadeh M. and Mahadevan S.. 2013.

__init__
(small=True, n_steps_action=3)[source]¶ Constructor.
Parameters:  small (bool, True) – whether to use a small state space or not.
 n_steps_action (int, 3) – number of integration intervals for each step of the mdp.

reset
(state=None)[source]¶ Reset the current state.
Parameters: state (np.ndarray, None) – the state to set to the current state. Returns: The current state.

step
(action)[source]¶ Move the agent from its current state according to the action.
Parameters: action (np.ndarray) – the action to execute. Returns: The state reached by the agent executing action
in its current state, the reward obtained in the transition and a flag to signal if the next state is absorbing. Also an additional dictionary is returned (possibly empty).

stop
()[source]¶ Method used to stop an mdp. Useful when dealing with real world environments, simulators, or when using openaigym rendering

static
_bound
(x, min_value, max_value)¶ Method used to bound state and action variables.
Parameters:  x – the variable to bound;
 min_value – the minimum value;
 max_value – the maximum value;
Returns: The bounded variable.

info
¶ An object containing the info of the environment.
Type: Returns

seed
(seed)¶ Set the seed of the environment.
Parameters: seed (float) – the value of the seed.

Generators¶
Grid world¶

mushroom_rl.environments.generators.grid_world.
generate_grid_world
(grid, prob, pos_rew, neg_rew, gamma=0.9, horizon=100)[source]¶ This Grid World generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting position where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘*’ is a hole, when the agent steps on a hole, it receives a negative reward and the episode ends; ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided.
The grid is expected to be rectangular.
Parameters:  grid (str) – the path of the file containing the grid structure;
 prob (float) – probability of success of an action;
 pos_rew (float) – reward obtained in goal states;
 neg_rew (float) – reward obtained in “hole” states;
 gamma (float, 9) – discount factor;
 horizon (int, 100) – the horizon.
Returns: A FiniteMDP object built with the provided parameters.

mushroom_rl.environments.generators.grid_world.
parse_grid
(grid)[source]¶ Parse the grid file:
Parameters: grid (str) – the path of the file containing the grid structure; Returns: A list containing the grid structure.

mushroom_rl.environments.generators.grid_world.
compute_probabilities
(grid_map, cell_list, prob)[source]¶ Compute the transition probability matrix.
Parameters:  grid_map (list) – list containing the grid structure;
 cell_list (list) – list of nonwall cells;
 prob (float) – probability of success of an action.
Returns: The transition probability matrix;

mushroom_rl.environments.generators.grid_world.
compute_reward
(grid_map, cell_list, pos_rew, neg_rew)[source]¶ Compute the reward matrix.
Parameters:  grid_map (list) – list containing the grid structure;
 cell_list (list) – list of nonwall cells;
 pos_rew (float) – reward obtained in goal states;
 neg_rew (float) – reward obtained in “hole” states;
Returns: The reward matrix.
Simple chain¶

mushroom_rl.environments.generators.simple_chain.
generate_simple_chain
(state_n, goal_states, prob, rew, mu=None, gamma=0.9, horizon=100)[source]¶ Simple chain generator.
Parameters:  state_n (int) – number of states;
 goal_states (list) – list of goal states;
 prob (float) – probability of success of an action;
 rew (float) – reward obtained in goal states;
 mu (np.ndarray) – initial state probability distribution;
 gamma (float, 9) – discount factor;
 horizon (int, 100) – the horizon.
Returns: A FiniteMDP object built with the provided parameters.
Taxi¶

mushroom_rl.environments.generators.taxi.
generate_taxi
(grid, prob=0.9, rew=(0, 1, 3, 15), gamma=0.99, horizon=inf)[source]¶ This Taxi generator requires a .txt file to specify the shape of the grid world and the cells. There are five types of cells: ‘S’ is the starting where the agent is; ‘G’ is the goal state; ‘.’ is a normal cell; ‘F’ is a passenger, when the agent steps on a hole, it picks up it. ‘#’ is a wall, when the agent is supposed to step on a wall, it actually remains in its current state. The initial states distribution is uniform among all the initial states provided. The episode terminates when the agent reaches the goal state. The reward is always 0, except for the goal state where it depends on the number of collected passengers. Each action has a certain probability of success and, if it fails, the agent goes in a perpendicular direction from the supposed one.
The grid is expected to be rectangular.
This problem is inspired from: “Bayesian QLearning”. Dearden R. et al.. 1998.
Parameters:  grid (str) – the path of the file containing the grid structure;
 prob (float, 9) – probability of success of an action;
 rew (tuple, (0, 1, 3, 15)) – rewards obtained in goal states;
 gamma (float, 99) – discount factor;
 horizon (int, np.inf) – the horizon.
Returns: A FiniteMDP object built with the provided parameters.

mushroom_rl.environments.generators.taxi.
parse_grid
(grid)[source]¶ Parse the grid file:
Parameters: grid (str) – the path of the file containing the grid structure. Returns: A list containing the grid structure.

mushroom_rl.environments.generators.taxi.
compute_probabilities
(grid_map, cell_list, passenger_list, prob)[source]¶ Compute the transition probability matrix.
Parameters:  grid_map (list) – list containing the grid structure;
 cell_list (list) – list of nonwall cells;
 passenger_list (list) – list of passenger cells;
 prob (float) – probability of success of an action.
Returns: The transition probability matrix;

mushroom_rl.environments.generators.taxi.
compute_reward
(grid_map, cell_list, passenger_list, rew)[source]¶ Compute the reward matrix.
Parameters:  grid_map (list) – list containing the grid structure;
 cell_list (list) – list of nonwall cells;
 passenger_list (list) – list of passenger cells;
 rew (tuple) – rewards obtained in goal states.
Returns: The reward matrix.

mushroom_rl.environments.generators.taxi.
compute_mu
(grid_map, cell_list, passenger_list)[source]¶ Compute the initial states distribution.
Parameters:  grid_map (list) – list containing the grid structure;
 cell_list (list) – list of nonwall cells;
 passenger_list (list) – list of passenger cells.
Returns: The initial states distribution.
Features¶
The features in MushroomRL are 1D arrays computed applying a specified function to a raw input, e.g. polynomial features of the state of an MDP. MushroomRL supports three types of features:
 basis functions;
 tensor basis functions;
 tiles.
The tensor basis functions are a PyTorch implementation of the standard basis functions. They are less straightforward than the standard ones, but they are faster to compute as they can exploit parallel computing, e.g. GPUacceleration and multicore systems.
All the types of features are exposed by a single factory method Features
that builds the one requested by the user.

mushroom_rl.features.features.
Features
(basis_list=None, tilings=None, tensor_list=None, n_outputs=None, function=None, device=None)[source]¶ Factory method to build the requested type of features. The types are mutually exclusive.
Possible features are tilings (
tilings
), basis functions (basis_list
), tensor basis (tensor_list
), and functional mappings (n_outputs
andfunction
).The difference between
basis_list
andtensor_list
is that the former is a list of python classes each one evaluating a single element of the feature vector, while the latter consists in a list of PyTorch modules that can be used to build a PyTorch network. The use oftensor_list
is a faster way to compute features than basis_list and is suggested when the computation of the requested features is slow (see the Gaussian radial basis function implementation as an example). A functional mapping applies a function to the input computing ann_outputs
dimensional vector, where the mapping is expressed byfunction
. Iffunction
is not provided, the identity is used.Parameters:  basis_list (list, None) – list of basis functions;
 tilings ([object, list], None) – single object or list of tilings;
 tensor_list (list, None) – list of dictionaries containing the instructions to build the requested tensors;
 n_outputs (int, None) – dimensionality of the feature mapping;
 function (object, None) – a callable function to be used as feature mapping. Only needed when using a functional mapping.
 device (int, None) – where to run the group of tensors. Only needed when using a list of tensors.
Returns: The class implementing the requested type of features.

mushroom_rl.features.features.
get_action_features
(phi_state, action, n_actions)[source]¶ Compute an array of size
len(phi_state)
*n_actions
filled with zeros, except for elements fromlen(phi_state)
*action
tolen(phi_state)
* (action
+ 1) that are filled with phi_state. This is used to compute stateaction features.Parameters:  phi_state (np.ndarray) – the feature of the state;
 action (np.ndarray) – the action whose features have to be computed;
 n_actions (int) – the number of actions.
Returns: The stateaction features.
The factory method returns a class that extends the abstract class
FeatureImplementation
.
The documentation for every feature type can be found here:
Basis¶
Fourier¶

class
mushroom_rl.features.basis.fourier.
FourierBasis
(low, delta, c, dimensions=None)[source]¶ Bases:
object
Class implementing Fourier basis functions. The value of the feature is computed using the formula:
\[\sum \cos{\pi(X  m)/\Delta c}\]where X is the input, m is the vector of the minumum input values (for each dimensions) , Delta is the vector of maximum

__init__
(low, delta, c, dimensions=None)[source]¶ Constructor.
Parameters:  low (np.ndarray) – vector of minimum values of the input variables;
 delta (np.ndarray) – vector of the maximum difference between two values of the input variables, i.e. delta = high  low;
 c (np.ndarray) – vector of weights for the state variables;
 dimensions (list, None) – list of the dimensions of the input to be considered by the feature.

static
generate
(low, high, n, dimensions=None)[source]¶ Factory method to build a set of fourier basis.
Parameters:  low (np.ndarray) – vector of minimum values of the input variables;
 high (np.ndarray) – vector of maximum values of the input variables;
 n (int) – number of harmonics to consider for each state variable
 dimensions (list, None) – list of the dimensions of the input to be considered by the features.
Returns: The list of the generated fourier basis functions.

Gaussian RBF¶

class
mushroom_rl.features.basis.gaussian_rbf.
GaussianRBF
(mean, scale, dimensions=None)[source]¶ Bases:
object
Class implementing Gaussian radial basis functions. The value of the feature is computed using the formula:
\[\sum \dfrac{(X_i  \mu_i)^2}{\sigma_i}\]where X is the input, mu is the mean vector and sigma is the scale parameter vector.

__init__
(mean, scale, dimensions=None)[source]¶ Constructor.
Parameters:  mean (np.ndarray) – the mean vector of the feature;
 scale (np.ndarray) – the scale vector of the feature;
 dimensions (list, None) – list of the dimensions of the input to be
considered by the feature. The number of dimensions must match
the dimensionality of
mean
andscale
.

static
generate
(n_centers, low, high, dimensions=None)[source]¶ Factory method to build uniformly spaced gaussian radial basis functions with a 25% overlap.
Parameters:  n_centers (list) – list of the number of radial basis functions to be used for each dimension.
 low (np.ndarray) – lowest value for each dimension;
 high (np.ndarray) – highest value for each dimension;
 dimensions (list, None) – list of the dimensions of the input to be
considered by the feature. The number of dimensions must match
the number of elements in
n_centers
andlow
.
Returns: The list of the generated radial basis functions.

Polynomial¶

class
mushroom_rl.features.basis.polynomial.
PolynomialBasis
(dimensions=None, degrees=None)[source]¶ Bases:
object
Class implementing polynomial basis functions. The value of the feature is computed using the formula:
\[\prod X_i^{d_i}\]where X is the input and d is the vector of the exponents of the polynomial.

__init__
(dimensions=None, degrees=None)[source]¶ Constructor. If both parameters are None, the constant feature is built.
Parameters:  dimensions (list, None) – list of the dimensions of the input to be considered by the feature;
 degrees (list, None) – list of the degrees of each dimension to be
considered by the feature. It must match the number of elements
of
dimensions
.

static
_compute_exponents
(order, n_variables)[source]¶ Find the exponents of a multivariate polynomial expression of order
order
andn_variables
number of variables.Parameters:  order (int) – the maximum order of the polynomial;
 n_variables (int) – the number of elements of the input vector.
Yields: The current exponent of the polynomial.

static
generate
(max_degree, input_size)[source]¶ Factory method to build a polynomial of order
max_degree
based on the firstinput_size
dimensions of the input.Parameters:  max_degree (int) – maximum degree of the polynomial;
 input_size (int) – size of the input.
Returns: The list of the generated polynomial basis functions.

Tensors¶
Gaussian tensor¶

class
mushroom_rl.features.tensors.gaussian_tensor.
PyTorchGaussianRBF
(mu, scale, dim)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Pytorch module to implement a gaussian radial basis function.

static
generate
(n_centers, low, high, dimensions=None)[source]¶ Factory method that generates the list of dictionaries to build the tensors representing a set of uniformly spaced Gaussian radial basis functions with a 25% overlap.
Parameters:  n_centers (list) – list of the number of radial basis functions to be used for each dimension;
 low (np.ndarray) – lowest value for each dimension;
 high (np.ndarray) – highest value for each dimension;
 dimensions (list, None) – list of the dimensions of the input to be
considered by the feature. The number of dimensions must match
the number of elements in
n_centers
andlow
.
Returns: The list of dictionaries as described above.

static
Tiles¶

class
mushroom_rl.features.tiles.tiles.
Tiles
(x_range, n_tiles, state_components=None)[source]¶ Bases:
object
Class implementing rectangular tiling. For each point in the state space, this class can be used to compute the index of the corresponding tile.

__init__
(x_range, n_tiles, state_components=None)[source]¶ Constructor.
Parameters:  x_range (list) – list of twoelements lists specifying the range of each state variable;
 n_tiles (list) – list of the number of tiles to be used for each dimension.
 state_components (list, None) – list of the dimensions of the input
to be considered by the tiling. The number of elements must
match the number of elements in
x_range
andn_tiles
.

static
generate
(n_tilings, n_tiles, low, high, uniform=False)[source]¶ Factory method to build
n_tilings
tilings ofn_tiles
tiles with a range betweenlow
andhigh
for each dimension.Parameters:  n_tilings (int) – number of tilings;
 n_tiles (list) – number of tiles for each tilings for each dimension;
 low (np.ndarray) – lowest value for each dimension;
 high (np.ndarray) – highest value for each dimension.
 uniform (bool, False) – if True the displacement for each tiling will be w/n_tilings, where w is the tile width. Otherwise, the displacement will be k*w/n_tilings, where k=2i+1, where i is the dimension index.
Returns: The list of the generated tiles.

Policy¶

class
mushroom_rl.policy.policy.
Policy
[source]¶ Bases:
mushroom_rl.core.serialization.Serializable
Interface representing a generic policy. A policy is a probability distribution that gives the probability of taking an action given a specified state. A policy is used by mushroom agents to interact with the environment.

__call__
(*args)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

reset
()[source]¶ Useful when the policy needs a special initialization at the beginning of an episode.

__init__
¶ Initialize self. See help(type(self)) for accurate signature.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.policy.
ParametricPolicy
[source]¶ Bases:
mushroom_rl.policy.policy.Policy
Interface for a generic parametric policy. A parametric policy is a policy that depends on set of parameters, called the policy weights. If the policy is differentiable, the derivative of the probability for a specified stateaction pair can be provided.

diff_log
(state, action)[source]¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

diff
(state, action)[source]¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

__call__
(*args)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

__init__
¶ Initialize self. See help(type(self)) for accurate signature.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Deterministic policy¶

class
mushroom_rl.policy.deterministic_policy.
DeterministicPolicy
(mu)[source]¶ Bases:
mushroom_rl.policy.policy.ParametricPolicy
Simple parametric policy representing a deterministic policy. As deterministic policies are degenerate probability functions where all the probability mass is on the deterministic action,they are not differentiable, even if the mean value approximator is differentiable.

__init__
(mu)[source]¶ Constructor.
Parameters: mu (Regressor) – the regressor representing the action to select in each state.

__call__
(state, action)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

diff_log
(state, action)¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Gaussian policy¶

class
mushroom_rl.policy.gaussian_policy.
AbstractGaussianPolicy
[source]¶ Bases:
mushroom_rl.policy.policy.ParametricPolicy
Abstract class of Gaussian policies.

__call__
(state, action)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

__init__
¶ Initialize self. See help(type(self)) for accurate signature.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

diff_log
(state, action)¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

get_weights
()¶ Getter.
Returns: The current policy weights.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

set_weights
(weights)¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.


class
mushroom_rl.policy.gaussian_policy.
GaussianPolicy
(mu, sigma)[source]¶ Bases:
mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy
Gaussian policy. This is a differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed.

__init__
(mu, sigma)[source]¶ Constructor.
Parameters:  mu (Regressor) – the regressor representing the mean w.r.t. the state;
 sigma (np.ndarray) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality.

set_sigma
(sigma)[source]¶ Setter.
Parameters: sigma (np.ndarray) – the new covariance matrix. Must be a square positive definite matrix.

diff_log
(state, action)[source]¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

__call__
(state, action)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.gaussian_policy.
DiagonalGaussianPolicy
(mu, std)[source]¶ Bases:
mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy
Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation vector. This is a differentiable policy for continuous action spaces. This policy is similar to the gaussian policy, but the weights includes also the standard deviation.

__init__
(mu, std)[source]¶ Constructor.
Parameters:  mu (Regressor) – the regressor representing the mean w.r.t. the state;
 std (np.ndarray) – a vector of standard deviations. The length of this vector must be equal to the action dimensionality.

set_std
(std)[source]¶ Setter.
Parameters: std (np.ndarray) – the new standard deviation. Must be a square positive definite matrix.

diff_log
(state, action)[source]¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

__call__
(state, action)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.gaussian_policy.
StateStdGaussianPolicy
(mu, std, eps=1e06)[source]¶ Bases:
mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy
Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation, which is computed for each state. This is a differentiable policy for continuous action spaces. This policy is similar to the diagonal gaussian policy, but a parametric regressor is used to compute the standard deviation, so the standard deviation depends on the current state.

__init__
(mu, std, eps=1e06)[source]¶ Constructor.
Parameters:  mu (Regressor) – the regressor representing the mean w.r.t. the state;
 std (Regressor) – the regressor representing the standard deviations w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality;
 eps (float, 1e6) – A positive constant added to the variance to ensure that is always greater than zero.

diff_log
(state, action)[source]¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

__call__
(state, action)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.gaussian_policy.
StateLogStdGaussianPolicy
(mu, log_std)[source]¶ Bases:
mushroom_rl.policy.gaussian_policy.AbstractGaussianPolicy
Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, the diagonal is computed by an exponential transformation of the logarithm of the standard deviation computed in each state. This is a differentiable policy for continuous action spaces. This policy is similar to the State std gaussian policy, but here the regressor represents the logarithm of the standard deviation.

diff_log
(state, action)[source]¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

__call__
(state, action)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

Noise policy¶

class
mushroom_rl.policy.noise_policy.
OrnsteinUhlenbeckPolicy
(mu, sigma, theta, dt, x0=None)[source]¶ Bases:
mushroom_rl.policy.policy.ParametricPolicy
OrnsteinUhlenbeck process as implemented in: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py.
This policy is commonly used in the Deep Deterministic Policy Gradient algorithm.

__init__
(mu, sigma, theta, dt, x0=None)[source]¶ Constructor.
Parameters:  mu (Regressor) – the regressor representing the mean w.r.t. the state;
 sigma (np.ndarray) – average magnitude of the random flactations per squareroot time;
 theta (float) – rate of mean reversion;
 dt (float) – time interval;
 x0 (np.ndarray, None) – initial values of noise.

__call__
(state, action)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

weights_size
¶ Property.
Returns: The size of the policy weights.

reset
()[source]¶ Useful when the policy needs a special initialization at the beginning of an episode.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

diff
(state, action)¶ Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:
\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the derivative is computed
 action (np.ndarray) – the action where the derivative is computed
Returns: The derivative w.r.t. the policy weights

diff_log
(state, action)¶ Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:
\[\nabla_{\theta}\log p(s,a)\]Parameters:  state (np.ndarray) – the state where the gradient is computed
 action (np.ndarray) – the action where the gradient is computed
Returns: The gradient of the logarithm of the pdf w.r.t. the policy weights

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

TD policy¶

class
mushroom_rl.policy.td_policy.
TDPolicy
[source]¶ Bases:
mushroom_rl.policy.policy.Policy

__call__
(*args)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.td_policy.
EpsGreedy
(epsilon)[source]¶ Bases:
mushroom_rl.policy.td_policy.TDPolicy
Epsilon greedy policy.

__init__
(epsilon)[source]¶ Constructor.
Parameters: epsilon (Parameter) – the exploration coefficient. It indicates the probability of performing a random actions in the current step.

__call__
(*args)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

set_epsilon
(epsilon)[source]¶ Setter.
Parameters:  epsilon (Parameter) – the exploration coefficient. It indicates the
 of performing a random actions in the current step. (probability) –

update
(*idx)[source]¶ Update the value of the epsilon parameter at the provided index (e.g. in case of different values of epsilon for each visited state according to the number of visits).
Parameters: *idx (list) – index of the parameter to be updated.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

get_q
()¶ Returns: The approximator used by the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

set_q
(approximator)¶ Parameters: approximator (object) – the approximator to use.


class
mushroom_rl.policy.td_policy.
Boltzmann
(beta)[source]¶ Bases:
mushroom_rl.policy.td_policy.TDPolicy
Boltzmann softmax policy.

__init__
(beta)[source]¶ Constructor.
Parameters:  beta (Parameter) – the inverse of the temperature distribution. As
 temperature approaches infinity, the policy becomes more and (the) –
 random. As the temperature approaches 0.0, the policy becomes (more) –
 and more greedy. (more) –

__call__
(*args)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

set_beta
(beta)[source]¶ Setter.
Parameters: beta (Parameter) – the inverse of the temperature distribution.

update
(*idx)[source]¶ Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).
Parameters: *idx (list) – index of the parameter to be updated.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

get_q
()¶ Returns: The approximator used by the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

set_q
(approximator)¶ Parameters: approximator (object) – the approximator to use.


class
mushroom_rl.policy.td_policy.
Mellowmax
(omega, beta_min=10.0, beta_max=10.0)[source]¶ Bases:
mushroom_rl.policy.td_policy.Boltzmann
Mellowmax policy. “An Alternative Softmax Operator for Reinforcement Learning”. Asadi K. and Littman M.L.. 2017.

__init__
(omega, beta_min=10.0, beta_max=10.0)[source]¶ Constructor.
Parameters:  omega (Parameter) – the omega parameter of the policy from which beta of the Boltzmann policy is computed;
 beta_min (float, 10.) – one end of the bracketing interval for minimization with Brent’s method;
 beta_max (float, 10.) – the other end of the bracketing interval for minimization with Brent’s method.

set_beta
(beta)[source]¶ Setter.
Parameters: beta (Parameter) – the inverse of the temperature distribution.

update
(*idx)[source]¶ Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).
Parameters: *idx (list) – index of the parameter to be updated.

__call__
(*args)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

get_q
()¶ Returns: The approximator used by the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

set_q
(approximator)¶ Parameters: approximator (object) – the approximator to use.

Torch policy¶

class
mushroom_rl.policy.torch_policy.
TorchPolicy
(use_cuda)[source]¶ Bases:
mushroom_rl.policy.policy.Policy
Interface for a generic PyTorch policy. A PyTorch policy is a policy implemented as a neural network using PyTorch. Functions ending with ‘_t’ use tensors as input, and also as output when required.

__call__
(state, action)[source]¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action
(state)[source]¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

distribution
(state)[source]¶ Compute the policy distribution in the given states.
Parameters: state (np.ndarray) – the set of states where the distribution is computed. Returns: The torch distribution for the provided states.

entropy
(state=None)[source]¶ Compute the entropy of the policy.
Parameters: state (np.ndarray, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state
can be None.Returns: The value of the entropy of the policy.

draw_action_t
(state)[source]¶ Draw an action given a tensor.
Parameters: state (torch.Tensor) – set of states. Returns: The tensor of the actions to perform in each state.

log_prob_t
(state, action)[source]¶ Compute the logarithm of the probability of taking
action
instate
.Parameters:  state (torch.Tensor) – set of states.
 action (torch.Tensor) – set of actions.
Returns: The tensor of logprobability.

entropy_t
(state=None)[source]¶ Compute the entropy of the policy.
Parameters: state (torch.Tensor) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state
can be None.Returns: The tensor value of the entropy of the policy.

distribution_t
(state)[source]¶ Compute the policy distribution in the given states.
Parameters: state (torch.Tensor) – the set of states where the distribution is computed. Returns: The torch distribution for the provided states.

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

parameters
()[source]¶ Returns the trainable policy parameters, as expected by torch optimizers.
Returns: List of parameters to be optimized.

reset
()[source]¶ Useful when the policy needs a special initialization at the beginning of an episode.

use_cuda
¶ True if the policy is using cuda_tensors.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.


class
mushroom_rl.policy.torch_policy.
GaussianTorchPolicy
(network, input_shape, output_shape, std_0=1.0, use_cuda=False, **params)[source]¶ Bases:
mushroom_rl.policy.torch_policy.TorchPolicy
Torch policy implementing a Gaussian policy with trainable standard deviation. The standard deviation is not statedependent.

__init__
(network, input_shape, output_shape, std_0=1.0, use_cuda=False, **params)[source]¶ Constructor.
Parameters:  network (object) – the network class used to implement the mean regressor;
 input_shape (tuple) – the shape of the state space;
 output_shape (tuple) – the shape of the action space;
 std_0 (float, 1.) – initial standard deviation;
 params (dict) – parameters used by the network constructor.

draw_action_t
(state)[source]¶ Draw an action given a tensor.
Parameters: state (torch.Tensor) – set of states. Returns: The tensor of the actions to perform in each state.

log_prob_t
(state, action)[source]¶ Compute the logarithm of the probability of taking
action
instate
.Parameters:  state (torch.Tensor) – set of states.
 action (torch.Tensor) – set of actions.
Returns: The tensor of logprobability.

entropy_t
(state=None)[source]¶ Compute the entropy of the policy.
Parameters: state (torch.Tensor) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state
can be None.Returns: The tensor value of the entropy of the policy.

distribution_t
(state)[source]¶ Compute the policy distribution in the given states.
Parameters: state (torch.Tensor) – the set of states where the distribution is computed. Returns: The torch distribution for the provided states.

set_weights
(weights)[source]¶ Setter.
Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy.

parameters
()[source]¶ Returns the trainable policy parameters, as expected by torch optimizers.
Returns: List of parameters to be optimized.

__call__
(state, action)¶ Compute the probability of taking action in a certain state following the policy.
Parameters: *args (list) – list containing a state or a state and an action. Returns: The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: **attr_dict – dictionary of attributes mapped to the method that should be used to save and load them. If a “!” character is added at the end of the method, the field will be saved only if full_save is set to True.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

distribution
(state)¶ Compute the policy distribution in the given states.
Parameters: state (np.ndarray) – the set of states where the distribution is computed. Returns: The torch distribution for the provided states.

draw_action
(state)¶ Sample an action in
state
using the policy.Parameters: state (np.ndarray) – the state where the agent is. Returns: The action sampled from the policy.

entropy
(state=None)¶ Compute the entropy of the policy.
Parameters: state (np.ndarray, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state
can be None.Returns: The value of the entropy of the policy.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (Path, string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

reset
()¶ Useful when the policy needs a special initialization at the beginning of an episode.

save
(path, full_save=False)¶ Serialize and save the object to the given path on disk.
Parameters:  path (Path, str) – Relative or absolute path to the object save location;
 full_save (bool) – Flag to specify the amount of data to save for MushroomRL data structures.

save_zip
(zip_file, full_save, folder='')¶ Serialize and save the agent to the given path on disk.
Parameters:  zip_file (ZipFile) – ZipFile where te object needs to be saved;
 full_save (bool) – flag to specify the amount of data to save for MushroomRL data structures;
 folder (string, '') – subfolder to be used by the save method.

use_cuda
¶ True if the policy is using cuda_tensors.

Solvers¶
Dynamic programming¶

mushroom_rl.solvers.dynamic_programming.
value_iteration
(prob, reward, gamma, eps)[source]¶ Value iteration algorithm to solve a dynamic programming problem.
Parameters:  prob (np.ndarray) – transition probability matrix;
 reward (np.ndarray) – reward matrix;
 gamma (float) – discount factor;
 eps (float) – accuracy threshold.
Returns: The optimal value of each state.

mushroom_rl.solvers.dynamic_programming.
policy_iteration
(prob, reward, gamma)[source]¶ Policy iteration algorithm to solve a dynamic programming problem.
Parameters:  prob (np.ndarray) – transition probability matrix;
 reward (np.ndarray) – reward matrix;
 gamma (float) – discount factor.
Returns: The optimal value of each state and the optimal policy.