Policy search

Policy gradient

class REINFORCE(*args, **kwargs)[source]

Bases: PolicyGradient

REINFORCE algorithm. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” Williams R. J. 1992.

__init__(mdp_info, policy, optimizer)[source]

Constructor.

Parameters:: optimizer – the gradient optimizer.

_compute_gradient(J)[source]

Return the gradient computed by the algorithm.

Parameters:: J (list) – list of the cumulative discounted rewards for each episode in the dataset.
Returns:: The gradient computed by the algorithm.

_step_update(x, u, r)[source]

This function is called, when parsing the dataset, at each episode step.

Parameters:

x (np.ndarray) – the state at the current step;
u (np.ndarray) – the action at the current step;
r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

class GPOMDP(*args, **kwargs)[source]

Bases: PolicyGradient

GPOMDP algorithm. “Infinite-Horizon Policy-Gradient Estimation”. Baxter J. and Bartlett P. L. 2001.

__init__(mdp_info, policy, optimizer)[source]

Constructor.

Parameters:: optimizer – the gradient optimizer.

_compute_gradient(J)[source]

Return the gradient computed by the algorithm.

Parameters:: J (list) – list of the cumulative discounted rewards for each episode in the dataset.
Returns:: The gradient computed by the algorithm.

_step_update(x, u, r)[source]

This function is called, when parsing the dataset, at each episode step.

Parameters:

x (np.ndarray) – the state at the current step;
u (np.ndarray) – the action at the current step;
r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

class eNAC(*args, **kwargs)[source]

Bases: PolicyGradient

Episodic Natural Actor Critic algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P. et al. 2013.

__init__(mdp_info, policy, optimizer, critic_features=None)[source]

Constructor.

Parameters:: critic_features (Features, None) – features used by the critic.

_compute_gradient(J)[source]

Return the gradient computed by the algorithm.

Parameters:: J (list) – list of the cumulative discounted rewards for each episode in the dataset.
Returns:: The gradient computed by the algorithm.

_step_update(x, u, r)[source]

This function is called, when parsing the dataset, at each episode step.

Parameters:

x (np.ndarray) – the state at the current step;
u (np.ndarray) – the action at the current step;
r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

Black-Box optimization

class ContextBuilder(*args, **kwargs)[source]

Bases: MushroomObject

__init__(context_shape=None)[source]

__call__(initial_state, **episode_info)[source]: Call self as a function.

class BlackBoxOptimization(*args, **kwargs)[source]

Bases: Agent

Base class for black box optimization algorithms. These algorithms work on a distribution of policy parameters, and often they do not rely on stochastic and differentiable policies.

__init__(mdp_info, distribution, policy, context_builder=None, backend='numpy')[source]

Constructor.

Parameters:

distribution (Distribution) – the distribution of policy parameters;
policy (HasWeights) – the policy to use;
context_builder (ContextBuilder, None) – class used to compute the context variables from initial state and the episode_info dictionary;
backend (str, 'numpy') – the backend used by the algorithm.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

episode_start_vectorized(initial_states, episode_info, start_mask)[source]

Called by the Core at the start of a new episode when using a vectorized environment.

Parameters:

initial_states (Array) – the initial states of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context;
start_mask (Array) – boolean mask to select the environments that are starting a new episode

Returns:

A tuple containing the policy initial states and, optionally, the policy parameters

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class RWR(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Reward-Weighted Regression algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P. et al. 2013.

__init__(mdp_info, distribution, policy, beta)[source]

Constructor.

Parameters:: beta ([float, Parameter]) – the temperature for the exponential reward transformation.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class REPS(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Episodic Relative Entropy Policy Search algorithm. “A Survey on Policy Search for Robotics” Deisenroth M. P. et al. 2013.

__init__(mdp_info, distribution, policy, eps)[source]

Constructor.

Parameters:: eps ([float, Parameter]) – the maximum admissible value for the Kullback-Leibler divergence between the new distribution and the previous one at each update step.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class PGPE(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Policy Gradient with Parameter Exploration algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P. et al. 2013.

__init__(mdp_info, distribution, policy, optimizer, context_builder=None)[source]

Constructor.

Parameters:: optimizer – the gradient step optimizer.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class ConstrainedREPS(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Episodic Relative Entropy Policy Search algorithm with constrained policy update.

__init__(mdp_info, distribution, policy, eps, kappa)[source]

Constructor.

Parameters:

eps ([float, Parameter]) – the maximum admissible value for the Kullback-Leibler divergence between the new distribution and the previous one at each update step.
kappa ([float, Parameter]) – the maximum admissible value for the entropy decrease between the new distribution and the previous one at each update step.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class MORE(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Model-Based Relative Entropy Stochastic Search algorithm. “Model-Based Relative Entropy Stochastic Search”, Abdolmaleki, A. et al. 2015.

__init__(mdp_info, distribution, policy, eps, h0=-75, kappa=0.99)[source]

Constructor.

Parameters:

distribution (GaussianCholeskyDistribution) – the distribution of policy parameters.
eps ([float, Parameter]) – the maximum admissible value for the Kullback-Leibler divergence between the new distribution and the previous one at each update step.
h0 ([float, Parameter]) – minimum exploration policy.
kappa ([float, Parameter]) – regularization parameter for the entropy decrease.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

class ePPO(*args, **kwargs)[source]

Bases: BlackBoxOptimization

Episodic adaptation of the Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al. 2017.

__init__(mdp_info, distribution, policy, optimizer, n_epochs_policy, batch_size, eps_ppo, ent_coeff=0.0, context_builder=None)[source]

Constructor.

Parameters:: optimizer – the gradient step optimizer.

_update(Jep, theta, context)[source]

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:

Jep (np.ndarray) – a vector containing the J of the considered trajectories;
theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.