# Policy¶

class mushroom.policy.policy.Policy[source]

Bases: object

Interface representing a generic policy. A policy is a probability distribution that gives the probability of taking an action given a specified state. A policy is used by mushroom agents to interact with the environment.

__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
reset()[source]

Useful when the policy needs a special initialization at the beginning of an episode.

__init__

Initialize self. See help(type(self)) for accurate signature.

class mushroom.policy.policy.ParametricPolicy[source]

Interface for a generic parametric policy. A parametric policy is a policy that depends on set of parameters, called the policy weights. If the policy is differentiable, the derivative of the probability for a specified state-action pair can be provided.

diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

$\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the gradient is computed action (np.ndarray) – the action where the gradient is computed The gradient of the logarithm of the pdf w.r.t. the policy weights
diff(state, action)[source]

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

$\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the derivative is computed action (np.ndarray) – the action where the derivative is computed The derivative w.r.t. the policy weights
set_weights(weights)[source]

Setter.

Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy
get_weights()[source]

Getter.

Returns: The current policy weights
weights_size

Property.

Returns: The size of the policy weights
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
__init__

Initialize self. See help(type(self)) for accurate signature.

draw_action(state)

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

## Gaussian policy¶

class mushroom.policy.gaussian_policy.GaussianPolicy(mu, sigma)[source]

Gaussian policy. This is a differentiable policy for continuous action spaces. The policy samples an action in every state following a gaussian distribution, where the mean is computed in the state and the covariance matrix is fixed.

__init__(mu, sigma)[source]

Constructor.

Parameters: mu (Regressor) – the regressor representing the mean w.r.t. the state; sigma (np.ndarray) – a square positive definite matrix representing the covariance matrix. The size of this matrix must be n x n, where n is the action dimensionality.
set_sigma(sigma)[source]

Setter.

Parameters: sigma (np.ndarray) – the new covariance matrix. Must be a square positive definite matrix.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

$\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the gradient is computed action (np.ndarray) – the action where the gradient is computed The gradient of the logarithm of the pdf w.r.t. the policy weights
set_weights(weights)[source]

Setter.

Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy
get_weights()[source]

Getter.

Returns: The current policy weights
weights_size

Property.

Returns: The size of the policy weights
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

$\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the derivative is computed action (np.ndarray) – the action where the derivative is computed The derivative w.r.t. the policy weights
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

class mushroom.policy.gaussian_policy.DiagonalGaussianPolicy(mu, std)[source]

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation vector. This is a differentiable policy for continuous action spaces. This policy is similar to the gaussian policy, but the weights includes also the standard deviation.

__init__(mu, std)[source]

Constructor.

Parameters: mu (Regressor) – the regressor representing the mean w.r.t. the state; std (np.ndarray) – a vector of standard deviations. The length of this vector must be equal to the action dimensionality.
set_std(std)[source]

Setter.

Parameters: std (np.ndarray) – the new standard deviation. Must be a square positive definite matrix.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

$\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the gradient is computed action (np.ndarray) – the action where the gradient is computed The gradient of the logarithm of the pdf w.r.t. the policy weights
set_weights(weights)[source]

Setter.

Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy
get_weights()[source]

Getter.

Returns: The current policy weights
weights_size

Property.

Returns: The size of the policy weights
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

$\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the derivative is computed action (np.ndarray) – the action where the derivative is computed The derivative w.r.t. the policy weights
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

class mushroom.policy.gaussian_policy.StateStdGaussianPolicy(mu, std, eps=1e-06)[source]

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, where the diagonal is the squared standard deviation, which is computed for each state. This is a differentiable policy for continuous action spaces. This policy is similar to the diagonal gaussian policy, but a parametric regressor is used to compute the standard deviation, so the standard deviation depends on the current state.

__init__(mu, std, eps=1e-06)[source]

Constructor.

Parameters: mu (Regressor) – the regressor representing the mean w.r.t. the state; std (Regressor) – the regressor representing the standard deviations w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality; eps (float, 1e-6) – A positive constant added to the variance to ensure that is always greater than zero.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

$\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the gradient is computed action (np.ndarray) – the action where the gradient is computed The gradient of the logarithm of the pdf w.r.t. the policy weights
set_weights(weights)[source]

Setter.

Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy
get_weights()[source]

Getter.

Returns: The current policy weights
weights_size

Property.

Returns: The size of the policy weights
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

$\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the derivative is computed action (np.ndarray) – the action where the derivative is computed The derivative w.r.t. the policy weights
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

class mushroom.policy.gaussian_policy.StateLogStdGaussianPolicy(mu, log_std)[source]

Gaussian policy with learnable standard deviation. The Covariance matrix is constrained to be a diagonal matrix, the diagonal is computed by an exponential transformation of the logarithm of the standard deviation computed in each state. This is a differentiable policy for continuous action spaces. This policy is similar to the State std gaussian policy, but here the regressor represents the logarithm of the standard deviation.

__init__(mu, log_std)[source]

Constructor.

Parameters: mu (Regressor) – the regressor representing the mean w.r.t. the state; log_std (Regressor) – a regressor representing the logarithm of the variance w.r.t. the state. The output dimensionality of the regressor must be equal to the action dimensionality.
__call__(state, action)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
diff_log(state, action)[source]

Compute the gradient of the logarithm of the probability density function, in the specified state and action pair, i.e.:

$\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the gradient is computed action (np.ndarray) – the action where the gradient is computed The gradient of the logarithm of the pdf w.r.t. the policy weights
set_weights(weights)[source]

Setter.

Parameters: weights (np.ndarray) – the vector of the new weights to be used by the policy
get_weights()[source]

Getter.

Returns: The current policy weights
weights_size

Property.

Returns: The size of the policy weights
diff(state, action)

Compute the derivative of the probability density function, in the specified state and action pair. Normally it is computed w.r.t. the derivative of the logarithm of the probability density function, exploiting the likelihood ratio trick, i.e.:

$\nabla_{\theta}p(s,a)=p(s,a)\nabla_{\theta}\log p(s,a)$
Parameters: state (np.ndarray) – the state where the derivative is computed action (np.ndarray) – the action where the derivative is computed The derivative w.r.t. the policy weights
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

## TD policy¶

class mushroom.policy.td_policy.TDPolicy[source]
__init__()[source]

Constructor.

set_q(approximator)[source]
Parameters: approximator (object) – the approximator to use.
get_q()[source]
Returns: The approximator used by the policy.
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

class mushroom.policy.td_policy.EpsGreedy(epsilon)[source]

Epsilon greedy policy.

__init__(epsilon)[source]

Constructor.

Parameters: epsilon (Parameter) – the exploration coefficient. It indicates the probability of performing a random actions in the current step.
__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
set_epsilon(epsilon)[source]

Setter.

Parameters: epsilon (Parameter) – the exploration coefficient. It indicates the of performing a random actions in the current step. (probability) –
update(*idx)[source]

Update the value of the epsilon parameter at the provided index (e.g. in case of different values of epsilon for each visited state according to the number of visits).

Parameters: *idx (list) – index of the parameter to be updated.
get_q()
Returns: The approximator used by the policy.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

set_q(approximator)
Parameters: approximator (object) – the approximator to use.
class mushroom.policy.td_policy.Boltzmann(beta)[source]

Boltzmann softmax policy.

__init__(beta)[source]

Constructor.

Parameters: beta (Parameter) – the inverse of the temperature distribution. As temperature approaches infinity, the policy becomes more and (the) – random. As the temperature approaches 0.0, the policy becomes (more) – and more greedy. (more) –
__call__(*args)[source]

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)[source]

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
get_q()
Returns: The approximator used by the policy.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

set_q(approximator)
Parameters: approximator (object) – the approximator to use.
class mushroom.policy.td_policy.Mellowmax(omega, beta_min=-10.0, beta_max=10.0)[source]

Mellowmax policy. “An Alternative Softmax Operator for Reinforcement Learning”. Asadi K. and Littman M.L.. 2017.

__init__(omega, beta_min=-10.0, beta_max=10.0)[source]

Constructor.

Parameters: omega (Parameter) – the omega parameter of the policy from which beta of the Boltzmann policy is computed; beta_min (float, -10.) – one end of the bracketing interval for minimization with Brent’s method; beta_max (float, 10.) – the other end of the bracketing interval for minimization with Brent’s method.
__call__(*args)

Compute the probability of taking action in a certain state following the policy.

Parameters: *args (list) – list containing a state or a state and an action. The probability of all actions following the policy in the given state if the list contains only the state, else the probability of the given action in the given state following the policy. If the action space is continuous, state and action must be provided
draw_action(state)

Sample an action in state using the policy.

Parameters: state (np.ndarray) – the state where the agent is. The action sampled from the policy.
get_q()
Returns: The approximator used by the policy.
reset()

Useful when the policy needs a special initialization at the beginning of an episode.

set_q(approximator)
Parameters: approximator (object) – the approximator to use.