Policy

A policy defines how the agent behaves: it maps the current state to an action, either deterministically or by sampling from a probability distribution over the action space. It is invoked by the agent through draw_action(), and it is the object that most learning algorithms optimize.

Two mixins add optional capabilities that can be combined with a policy: HasWeights equips it with a set of trainable weights (used by policy-search and black-box optimization algorithms), while HasGradient additionally provides the gradient of the log-probability required by policy-gradient methods.

MushroomRL provides several families of policies:

Deterministic policies return a single action for each state;
Gaussian policies are differentiable parametric policies that sample from a Gaussian distribution;
TD policies are value-based policies that select the action from a Q-function (e.g. epsilon-greedy or Boltzmann);
Torch policies are implemented as neural networks and support tensor computation for deep RL;
Movement primitives are trajectory generators implementing DMPs and ProMPs;
Vector policies wrap a population of policies for vectorized black-box optimization.

Policies in MushroomRL can depend on the past in two orthogonal ways. A stateful policy (StatefulPolicy) carries a latent internal state, updated at every step and stored in the dataset because it cannot be reconstructed (e.g. a recurrent hidden state or Ornstein-Uhlenbeck noise). The context is instead a deterministic function of the observed trajectory (e.g. a window of stacked observations); being reconstructable from the stored transitions, it is assembled on the fly by the HistoryManager rather than stored as policy state.

class Policy(*args, **kwargs)[source]

Bases: MushroomObject

Interface representing a generic policy. A policy is a probability distribution that gives the probability of taking an action given a specified state. A policy is used by mushroom agents to interact with the environment.

__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state, **kwargs)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

Returns:: The initial policy state (by default None).

reset_vectorized(start_mask)[source]

Reset the policy for the environments selected by start_mask at the beginning of an episode.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The initial policy states (by default None).

stop()[source]: Called at the end of a run to reset any transient internal state. No-op by default.

property is_stateful: Whether the policy carries an internal state that is updated step-by-step.

class StatefulPolicy(*args, **kwargs)[source]

Bases: Policy

Interface representing a stateful policy, i.e. a policy carrying a latent internal state (e.g. the hidden state of a recurrent network, the noise of an Ornstein-Uhlenbeck process, or the phase of a movement primitive) that is updated at every step.

The current policy state is stored inside the policy and exposed through the policy_state property, so that the Core can record it into the dataset for logging and learning. The query methods, instead, take the policy state explicitly and never touch the stored one.

__init__(policy_state_shape)[source]

Constructor.

Parameters:: policy_state_shape (tuple) – the shape of the internal state of the policy.

property is_stateful: Whether the policy carries an internal state that is updated step-by-step.

property policy_state: The current internal state of the policy.

draw_action(state, policy_state=None, **kwargs)[source]

Sample an action in state using the policy.

When policy_state is not provided, the policy uses and updates its internal state. When it is provided, the policy uses the given state and leaves the internal one untouched (functional evaluation).

Parameters:

state – the state where the agent is;
policy_state (None) – the internal state of the policy. If None, the stored internal state is used and updated;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

reset()[source]

Reset the internal state of the policy at the beginning of an episode. Implementations must set self._policy_state and return it.

Returns:: The initial policy state.

reset_vectorized(start_mask)[source]

Reset the internal state of the policy for the environments selected by start_mask, leaving the other environments untouched. Implementations must set self._policy_state and return it.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The batch of policy states after the masked reset.

stop()[source]: Clear the internal state at the end of a run, so that the next run reinitializes it from scratch (and can therefore run with a different number of environments or a different vectorization mode).

_draw_action(state, policy_state, **kwargs)[source]

Sample an action in state given the policy state, returning the next policy state. This is the functional core of draw_action() and must not mutate the internal state.

Parameters:

state – the state where the agent is;
policy_state – the internal state of the policy;
**kwargs – additional per-timestep conditioning inputs.

Returns:

A tuple containing the sampled action and the next policy state.

class HasWeights[source]

Bases: object

Mixin adding a set of trainable parameters (the policy weights) to a policy. It is meant to be combined with a Policy subclass (e.g. class MyPolicy(Policy, HasWeights) or class MyPolicy(StatefulPolicy, HasWeights)); on its own it is not a policy. If the policy is also differentiable, use HasGradient instead.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

class HasGradient[source]

Bases: HasWeights

Mixin for a parametric policy that is also differentiable, i.e. one for which the gradient of the log-probability w.r.t. the policy weights can be computed. It extends HasWeights with the derivative of the log-probability; policies that carry weights but are not differentiable should use HasWeights directly.

diff_log(state, action, policy_state=None)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the gradient is computed;
action – the action where the gradient is computed;
policy_state – the internal state of the policy.

Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

diff(state, action, policy_state=None)[source]

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

\[\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the derivative is computed;
action – the action where the derivative is computed;
policy_state – the internal state of the policy.

Returns:

The derivative w.r.t. the policy weights

Deterministic policy

class DeterministicPolicy(*args, **kwargs)[source]

Bases: Policy, HasWeights

Simple parametric policy representing a deterministic policy. As deterministic policies are degenerate probability functions where all the probability mass is on the deterministic action,they are not differentiable, even if the mean value approximator is differentiable.

__init__(mu)[source]

Constructor.

Parameters:: mu (Regressor) – the regressor representing the action to select in each state.

get_regressor()[source]

Getter.

Returns:: The regressor that is used to map state to actions.

__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

Gaussian policy

class AbstractGaussianPolicy(*args, **kwargs)[source]

Bases: Policy, HasGradient

Abstract class of Gaussian policies.

__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

class GaussianPolicy(*args, **kwargs)[source]

Bases: AbstractGaussianPolicy

Gaussian policy. This is a differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed.

__init__(mu, sigma)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
sigma (np.ndarray) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality.

set_sigma(sigma)[source]

Setter.

Parameters:: sigma (np.ndarray) – the new covariance matrix. Must be a square positive definite matrix.

diff_log(state, action, policy_state=None)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the gradient is computed;
action – the action where the gradient is computed;
policy_state – the internal state of the policy.

Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

class DiagonalGaussianPolicy(*args, **kwargs)[source]

Bases: AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation vector. This is a differentiable policy for continuous action spaces. This policy is similar to the gaussian policy, but the weights includes also the standard deviation.

__init__(mu, std)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
std (np.ndarray) – a vector of standard deviations. The length of this vector must be equal to the action dimensionality.

set_std(std)[source]

Setter.

Parameters:: std (np.ndarray) – the new standard deviation. Must be a square positive definite matrix.

diff_log(state, action, policy_state=None)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the gradient is computed;
action – the action where the gradient is computed;
policy_state – the internal state of the policy.

Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

class StateStdGaussianPolicy(*args, **kwargs)[source]

Bases: AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation, which is computed for each state. This is a differentiable policy for continuous action spaces. This policy is similar to the diagonal gaussian policy, but a parametric regressor is used to compute the standard deviation, so the standard deviation depends on the current state.

__init__(mu, std, eps=1e-06)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
std (Regressor) – the regressor representing the standard deviations w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality;
eps (float, 1e-6) – A positive constant added to the variance to ensure that is always greater than zero.

diff_log(state, action, policy_state=None)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the gradient is computed;
action – the action where the gradient is computed;
policy_state – the internal state of the policy.

Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

class StateLogStdGaussianPolicy(*args, **kwargs)[source]

Bases: AbstractGaussianPolicy

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, the diagonal is computed by an exponential transformation of the logarithm of the standard deviation computed in each state. This is a differentiable policy for continuous action spaces. This policy is similar to the State std gaussian policy, but here the regressor represents the logarithm of the standard deviation.

__init__(mu, log_std)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
log_std (Regressor) – a regressor representing the logarithm of the variance w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality.

diff_log(state, action, policy_state=None)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

\[\nabla_{\theta}\log p(s,a)\]

Parameters:

state – the state where the gradient is computed;
action – the action where the gradient is computed;
policy_state – the internal state of the policy.

Returns:

The gradient of the logarithm of the pdf w.r.t. the policy weights

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

Noise policy

class OrnsteinUhlenbeckPolicy(*args, **kwargs)[source]

Bases: StatefulPolicy, HasWeights

Ornstein-Uhlenbeck process as implemented in: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py.

This policy is commonly used in the Deep Deterministic Policy Gradient algorithm.

__init__(mu, sigma, theta, dt, x0=None)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
sigma (torch.tensor) – average magnitude of the random fluctations per square-root time;
theta (float) – rate of mean reversion;
dt (float) – time interval;
x0 (torch.tensor, None) – initial values of noise.

__call__(state, action=None, policy_state=None)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_draw_action(state, policy_state)[source]

Sample an action in state given the policy state, returning the next policy state. This is the functional core of draw_action() and must not mutate the internal state.

Parameters:

state – the state where the agent is;
policy_state – the internal state of the policy;
**kwargs – additional per-timestep conditioning inputs.

Returns:

A tuple containing the sampled action and the next policy state.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

reset()[source]

Reset the internal state of the policy at the beginning of an episode. Implementations must set self._policy_state and return it.

Returns:: The initial policy state.

reset_vectorized(start_mask)[source]

Reset the internal state of the policy for the environments selected by start_mask, leaving the other environments untouched. Implementations must set self._policy_state and return it.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The batch of policy states after the masked reset.

class ClippedGaussianPolicy(*args, **kwargs)[source]

Bases: Policy, HasWeights

Clipped Gaussian policy, as used in:

“Addressing Function Approximation Error in Actor-Critic Methods”. Fujimoto S. et al.. 2018.

This is a non-differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed. The action is then clipped using the given action range. This policy is not a truncated Gaussian, as it simply clips the action if the value is bigger than the boundaries. Thus, the non-differentiability.

__init__(mu, sigma, low, high)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean w.r.t. the state;
sigma (torch.tensor) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality;
low (torch.tensor) – a vector containing the minimum action for each component;
high (torch.tensor) – a vector containing the maximum action for each component.

__call__(state, action=None)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

TD policy

class TDPolicy(*args, **kwargs)[source]

Bases: Policy

__init__(backend='numpy')[source]

Constructor.

Parameters:: backend (str, 'numpy') – name of the array backend used by the policy.

set_q(approximator)[source]

Parameters:: approximator (object) – the approximator to use.

get_q()[source]

Returns:: The approximator used by the policy.

class EpsGreedy(*args, **kwargs)[source]

Bases: TDPolicy

Epsilon greedy policy.

__init__(epsilon, backend='numpy')[source]

Constructor.

Parameters:

epsilon ([float, Parameter]) – the exploration coefficient. It indicates the probability of performing a random actions in the current step;
backend (str, 'numpy') – name of the array backend used by the policy.

__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

set_epsilon(epsilon)[source]

Setter.

Parameters:

epsilon ([float, Parameter]) – the exploration coefficient. It indicates the
step. (probability of performing a random actions in the current)

update(*idx)[source]

Update the value of the epsilon parameter at the provided index (e.g. in case of different values of epsilon for each visited state according to the number of visits).

Parameters:: *idx (list) – index of the parameter to be updated.

class Boltzmann(*args, **kwargs)[source]

Bases: TDPolicy

Boltzmann softmax policy.

__init__(beta, backend='numpy')[source]

Constructor.

Parameters:

beta ([float, Parameter]) – the inverse of the temperature distribution. As
infinity (the temperature approaches)
and (the policy becomes more)
0.0 (more random. As the temperature approaches)
becomes (the policy)
greedy; (more and more)
backend (str, 'numpy') – name of the array backend used by the policy.

__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

set_beta(beta)[source]

Setter.

Parameters:: beta ((float, Parameter)) – the inverse of the temperature distribution.

update(*idx)[source]

Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).

Parameters:: *idx (list) – index of the parameter to be updated.

class Mellowmax(*args, **kwargs)[source]

Bases: Boltzmann

Mellowmax policy. “An Alternative Softmax Operator for Reinforcement Learning”. Asadi K. and Littman M.L.. 2017.

class MellowmaxParameter(*args, **kwargs)[source]

Bases: Parameter

__init__(outer, omega, beta_min, beta_max)[source]

Constructor.

Parameters:

value (float) – initial value of the parameter;
min_value (float, None) – minimum value that the parameter can reach when decreasing;
max_value (float, None) – maximum value that the parameter can reach when increasing;
size (tuple, (1,)) – shape of the matrix of parameters; this shape can be used to have a single parameter for each state or state-action tuple.
log_table (bool, False) – if True, the parameter is logged also when it is backed by a table with more than one element. By default tabular parameters are not logged, as logging a per-state or per-state-action value on every update is too expensive.

__call__(state)[source]

Update and return the parameter in the provided index.

Parameters:: *idx (list) – index of the parameter to return.
Returns:: The updated parameter in the provided index.

__init__(omega, beta_min=-10.0, beta_max=10.0, backend='numpy')[source]

Constructor.

Parameters:

omega (Parameter) – the omega parameter of the policy from which beta of the Boltzmann policy is computed;
beta_min (float, -10.) – one end of the bracketing interval for minimization with Brent’s method;
beta_max (float, 10.) – the other end of the bracketing interval for minimization with Brent’s method;
backend (str, 'numpy') – name of the array backend used by the policy.

set_beta(beta)[source]

Setter.

Parameters:: beta ((float, Parameter)) – the inverse of the temperature distribution.

update(*idx)[source]

Update the value of the beta parameter at the provided index (e.g. in case of different values of beta for each visited state according to the number of visits).

Parameters:: *idx (list) – index of the parameter to be updated.

Torch policy

class TorchPolicy(*args, **kwargs)[source]

Bases: Policy

Interface for a generic PyTorch policy. A PyTorch policy is a policy implemented as a neural network using PyTorch. Its methods operate directly on torch tensors.

__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

draw_with_log_prob(state)[source]

Sample an action in state using the reparametrization trick and compute its log probability. Since the action is sampled through the reparametrization trick, gradients can flow through both the action and its log probability.

Parameters:: state (torch.Tensor) – the set of states where the action is sampled.
Returns:: The sampled action and its log probability.

log_prob(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions.

Returns:

The tensor of log-probability.

entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:: state (torch.Tensor, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:: The value of the entropy of the policy.

distribution(state)[source]

Compute the policy distribution in the given states.

Parameters:: state (torch.Tensor) – the set of states where the distribution is computed.
Returns:: The torch distribution for the provided states.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:: List of parameters to be optimized.

class GaussianTorchPolicy(*args, **kwargs)[source]

Bases: TorchPolicy

Torch policy implementing a Gaussian policy with trainable standard deviation. The standard deviation is not state-dependent.

__init__(network, input_shape, output_shape, std_0=1.0, **params)[source]

Constructor.

Parameters:

network (object) – the network class used to implement the mean regressor;
input_shape (tuple) – the shape of the state space;
output_shape (tuple) – the shape of the action space;
std_0 (float, 1.) – initial standard deviation;
**params – parameters used by the network constructor.

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

draw_with_log_prob(state)[source]

Sample an action in state using the reparametrization trick and compute its log probability. Since the action is sampled through the reparametrization trick, gradients can flow through both the action and its log probability.

Parameters:: state (torch.Tensor) – the set of states where the action is sampled.
Returns:: The sampled action and its log probability.

log_prob(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions.

Returns:

The tensor of log-probability.

entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:: state (torch.Tensor, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:: The value of the entropy of the policy.

distribution(state)[source]

Compute the policy distribution in the given states.

Parameters:: state (torch.Tensor) – the set of states where the distribution is computed.
Returns:: The torch distribution for the provided states.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:: List of parameters to be optimized.

class BoltzmannTorchPolicy(*args, **kwargs)[source]

Bases: TorchPolicy

Torch policy implementing a Boltzmann policy.

__init__(network, input_shape, output_shape, beta, **params)[source]

Constructor.

Parameters:

network (object) – the network class used to implement the mean regressor;
input_shape (tuple) – the shape of the state space;
output_shape (tuple) – the shape of the action space;
beta ([float, Parameter]) – the inverse of the temperature distribution. As the temperature approaches infinity, the policy becomes more and more random. As the temperature approaches 0.0, the policy becomes more and more greedy.
**params – parameters used by the network constructor.

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

draw_with_log_prob(state)[source]

Sample an action in state using the reparametrization trick and compute its log probability. Since the action is sampled through the reparametrization trick, gradients can flow through both the action and its log probability.

Parameters:: state (torch.Tensor) – the set of states where the action is sampled.
Returns:: The sampled action and its log probability.

log_prob(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions.

Returns:

The tensor of log-probability.

entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:: state (torch.Tensor, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:: The value of the entropy of the policy.

distribution(state)[source]

Compute the policy distribution in the given states.

Parameters:: state (torch.Tensor) – the set of states where the distribution is computed.
Returns:: The torch distribution for the provided states.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:: List of parameters to be optimized.

class SquashedGaussianTorchPolicy(*args, **kwargs)[source]

Bases: TorchPolicy

Torch policy implementing a Gaussian policy squashed by a tanh and remapped to a bounded action range, as used by the Soft Actor-Critic algorithm. The squashing and the corresponding change-of-variables are handled by the SquashedGaussian distribution.

__init__(mu_approximator, sigma_approximator, min_a, max_a, log_std_min, log_std_max)[source]

Constructor.

Parameters:

mu_approximator (Approximator) – a regressor computing the mean given a state;
sigma_approximator (Approximator) – a regressor computing the log standard deviation given a state;
min_a (np.ndarray) – a vector specifying the minimum action value for each component;
max_a (np.ndarray) – a vector specifying the maximum action value for each component;
log_std_min ([float, Parameter]) – min value for the policy log std;
log_std_max ([float, Parameter]) – max value for the policy log std.

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

draw_with_log_prob(state)[source]

Sample an action in state using the reparametrization trick and compute its log probability. Since the action is sampled through the reparametrization trick, gradients can flow through both the action and its log probability.

Parameters:: state (torch.Tensor) – the set of states where the action is sampled.
Returns:: The sampled action and its log probability.

log_prob(state, action)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions.

Returns:

The tensor of log-probability.

entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:: state (torch.Tensor, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:: The value of the entropy of the policy.

distribution(state)[source]

Compute the policy distribution in the given states.

Parameters:: state (torch.Tensor) – the set of states where the distribution is computed.
Returns:: The torch distribution for the provided states.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:: List of parameters to be optimized.

Stateful Torch policy

class StatefulTorchPolicy(*args, **kwargs)[source]

Bases: StatefulPolicy, TorchPolicy

Interface for a stateful PyTorch policy, i.e. a TorchPolicy carrying a latent internal state (e.g. the hidden state of a recurrent network). draw_action relies on the stored internal state (see StatefulPolicy), while the query methods take the policy state and the sequence lengths explicitly, so they never depend on the stored one.

draw_with_log_prob(state, policy_state, lengths, **kwargs)[source]

Sample an action using the reparametrization trick and compute its log probability.

Parameters:

state (torch.Tensor) – the set of states where the action is sampled;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The sampled action, its log probability and the next policy state.

log_prob(state, action, policy_state, lengths, **kwargs)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The tensor of log-probability.

distribution(state, policy_state, lengths, **kwargs)[source]

Compute the policy distribution in the given states.

Parameters:

state (torch.Tensor) – the set of states where the distribution is computed;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The torch distribution for the provided states.

class RecurrentGaussianTorchPolicy(*args, **kwargs)[source]

Bases: StatefulTorchPolicy

Torch policy implementing a Gaussian policy whose mean is computed by a recurrent network. The hidden state of the network is the latent policy state, carried step-by-step at inference time and provided explicitly to the query methods together with the sequence lengths.

__init__(network, input_shape, output_shape, policy_state_shape, std_0=1.0, log_std_min=-20, log_std_max=2, **params)[source]

Constructor.

Parameters:

network (object) – the network class used to implement the mean regressor. Its forward must return (action_mean, next_policy_state);
input_shape (tuple) – the shape of the state space;
output_shape (tuple) – the shape of the action space (the network internally also receives policy_state_shape as its second output shape);
policy_state_shape (tuple) – the shape of the hidden state of the recurrent network;
std_0 (float, 1.) – initial standard deviation;
log_std_min ([float, Parameter], -20) – min value for the policy log std;
log_std_max ([float, Parameter], 2) – max value for the policy log std;
**params – parameters used by the network constructor.

draw_with_log_prob(state, policy_state, lengths, **kwargs)[source]

Sample an action using the reparametrization trick and compute its log probability.

Parameters:

state (torch.Tensor) – the set of states where the action is sampled;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The sampled action, its log probability and the next policy state.

log_prob(state, action, policy_state, lengths, **kwargs)[source]

Compute the logarithm of the probability of taking action in state.

Parameters:

state (torch.Tensor) – set of states;
action (torch.Tensor) – set of actions;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The tensor of log-probability.

entropy(state=None)[source]

Compute the entropy of the policy.

Parameters:: state (torch.Tensor, None) – the set of states to consider. If the entropy of the policy can be computed in closed form, then state can be None.
Returns:: The value of the entropy of the policy.

distribution(state, policy_state, lengths, **kwargs)[source]

Compute the policy distribution in the given states.

Parameters:

state (torch.Tensor) – the set of states where the distribution is computed;
policy_state (torch.Tensor) – the policy internal states;
lengths (torch.Tensor) – the length of each input sequence;
**kwargs – additional per-timestep conditioning inputs.

Returns:

The torch distribution for the provided states.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

parameters()[source]

Returns the trainable policy parameters, as expected by torch optimizers.

Returns:: List of parameters to be optimized.

reset()[source]

Reset the internal state of the policy at the beginning of an episode. Implementations must set self._policy_state and return it.

Returns:: The initial policy state.

reset_vectorized(start_mask)[source]

Reset the internal state of the policy for the environments selected by start_mask, leaving the other environments untouched. Implementations must set self._policy_state and return it.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The batch of policy states after the masked reset.

_draw_action(state, policy_state, action_history=None)[source]

Sample an action in state given the policy state, returning the next policy state. This is the functional core of draw_action() and must not mutate the internal state.

Parameters:

state – the state where the agent is;
policy_state – the internal state of the policy;
**kwargs – additional per-timestep conditioning inputs.

Returns:

A tuple containing the sampled action and the next policy state.

Movement primitives

class ProMP(*args, **kwargs)[source]

Bases: StatefulPolicy, HasWeights

Class representing a Probabilistic Movement Primitive (ProMP). Specifically, this class represents the low-level gaussian time-dependant policy.

Differently from the original implementation of ProMPs, an arbitrary regressor can be used to compute the mean from time features. By using a non-linear regressor, the theory behind conditioning might not hold.

__init__(mu, phi, duration, sigma=None, periodic=False)[source]

Constructor.

Parameters:

mu (Regressor) – the regressor representing the mean at each time step;
phi (Features) – Basis functions used as time features;
duration (int) – duration of the movement in number of steps;
sigma (np.ndarray; None) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality. If not specified, the policy returns the mean value;
periodic (bool, False) – whether the movement represented is periodic or not. If true, the duration parameter represent the duration of a period, and the phase variable increase continuously

__call__(state, action, policy_state=None)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_draw_action(state, policy_state)[source]

Sample an action in state given the policy state, returning the next policy state. This is the functional core of draw_action() and must not mutate the internal state.

Parameters:

state – the state where the agent is;
policy_state – the internal state of the policy;
**kwargs – additional per-timestep conditioning inputs.

Returns:

A tuple containing the sampled action and the next policy state.

update_time(state, policy_state)[source]

Method that updates the time counter. Can be overridden to introduce complex state-dependant behaviors.

Parameters:: state (np.ndarray) – The current state of the system.

_compute_phase(state, policy_state)[source]

Method that updates the state variable. It can be overridden to implement state dependent phase.

Parameters:: state (np.ndarray) – The current state of the system.
Returns:: The current value of the phase variable

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

set_duration(duration)[source]: Set the duration of the movement

reset()[source]

Reset the internal state of the policy at the beginning of an episode. Implementations must set self._policy_state and return it.

Returns:: The initial policy state.

reset_vectorized(start_mask)[source]

Reset the internal state of the policy for the environments selected by start_mask, leaving the other environments untouched. Implementations must set self._policy_state and return it.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The batch of policy states after the masked reset.

class DMP(*args, **kwargs)[source]

Bases: StatefulPolicy, HasWeights

Class representing a Dynamic Movement Primitive (DMP).

Differently from the original implementation of DMP, an arbitrary regressor can be used to compute the mean from phase variable.

The internal state of the dynamical system, i.e. the canonical velocity v, the phase x, and the transformation system variables z and y, is stored in the policy state as a single array stacking the four variables, with shape (4,) + action_shape.

__init__(mu, phi, goal, dt, tau, alpha_v, beta_v, alpha_z, beta_z)[source]

Constructor.

Parameters:: policy_state_shape (tuple) – the shape of the internal state of the policy.

__call__(state, action, policy_state=None)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters:

state – state where you want to evaluate the policy density;
action – action where you want to evaluate the policy density.

Returns:

The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided

_draw_action(state, policy_state)[source]

Sample an action in state given the policy state, returning the next policy state. This is the functional core of draw_action() and must not mutate the internal state.

Parameters:

state – the state where the agent is;
policy_state – the internal state of the policy;
**kwargs – additional per-timestep conditioning inputs.

Returns:

A tuple containing the sampled action and the next policy state.

update_system(state, policy_state)[source]

Method that updates the dynamical system. Can be overridden to introduce complex state-dependant behaviors.

Parameters:

state (np.ndarray) – the current state of the environment;
policy_state (np.ndarray) – the internal state of the DMP, stacking the [v, x, z, y] variables.

Returns:

The updated internal state of the DMP and its y variable (the action).

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

_split_variables(policy_state)[source]: Return a view of the internal state with the [v, x, z, y] variables on the leading axis, both for a single state of shape (4,) + action_shape and a batched one of shape (n_envs, 4) + action_shape. In-place updates on the unpacked variables write back into policy_state.

reset()[source]

Reset the internal state of the policy at the beginning of an episode. Implementations must set self._policy_state and return it.

Returns:: The initial policy state.

reset_vectorized(start_mask)[source]

Reset the internal state of the policy for the environments selected by start_mask, leaving the other environments untouched. Implementations must set self._policy_state and return it.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The batch of policy states after the masked reset.

Vector policy

class VectorPolicy(*args, **kwargs)[source]

Bases: Policy, HasWeights

Policy wrapping a vector of independent copies of a base policy, each one with its own weights. It is used by black-box optimization algorithms to evaluate a population of parameterizations in parallel, one per environment. Each wrapped policy manages its own internal state (if stateful), so no policy state is threaded through this wrapper.

__init__(policy, n_envs)[source]

Constructor.

Parameters:

policy (HasWeights) – base policy to copy;
n_envs (int) – number of environments to be repeated.

draw_action(state)[source]

Sample an action in state using the policy.

Parameters:

state – the state where the agent is;
**kwargs – additional per-timestep conditioning inputs assembled by the agent; policies that do not consume them can ignore the keyword arguments.

Returns:

The action sampled from the policy.

set_weights(weights)[source]

Setter.

Parameters:: weights (np.ndarray) – the vector of the new weights to be used by the policy.

get_weights()[source]

Getter.

Returns:: The current policy weights.

property weights_size

Property.

Returns:: The size of the policy weights.

reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

Returns:: The initial policy state (by default None).

reset_vectorized(start_mask)[source]

Reset the policy for the environments selected by start_mask at the beginning of an episode.

Parameters:: start_mask – boolean mask selecting the environments that are starting a new episode.
Returns:: The initial policy states (by default None).

stop()[source]: Called at the end of a run to reset any transient internal state. No-op by default.