ActorCritic¶
Classical ActorCritic Methods¶

class
mushroom.algorithms.actor_critic.classic_actor_critic.
COPDAC_Q
(policy, mu, mdp_info, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Compatible offpolicy deterministic actorcritic algorithm. “Deterministic Policy Gradient Algorithms”. Silver D. et al.. 2014.

__init__
(policy, mu, mdp_info, alpha_theta, alpha_omega, alpha_v, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters:  policy (Policy) – any exploration policy, possibly using the deterministic policy as mean regressor;
 mu (Regressor) – regressor that describe the deterministic policy to be learned i.e., the deterministic mapping between state and action.
 alpha_theta (Parameter) – learning rate for policy update;
 alpha_omega (Parameter) – learning rate for the advantage function;
 alpha_v (Parameter) – learning rate for the value function;
 value_function_features (Features, None) – features used by the value function approximator;
 policy_features (Features, None) – features used by the policy.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.classic_actor_critic.
StochasticAC
(policy, mdp_info, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Stochastic Actor critic in the episodic setting as presented in: “ModelFree Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__
(policy, mdp_info, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters:  policy (ParametricPolicy) – a differentiable stochastic policy;
 mdp_info – information about the MDP;
 alpha_theta (Parameter) – learning rate for policy update;
 alpha_v (Parameter) – learning rate for the value function;
 lambda_par (float, 9) – trace decay parameter;
 value_function_features (Features, None) – features used by the value function approximator;
 policy_features (Features, None) – features used by the policy.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.classic_actor_critic.
StochasticAC_AVG
(policy, mdp_info, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Stochastic Actor critic in the average reward setting as presented in: “ModelFree Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__
(policy, mdp_info, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None, policy_features=None)[source]¶ Constructor.
Parameters:  policy (ParametricPolicy) – a differentiable stochastic policy;
 mdp_info – information about the MDP;
 alpha_theta (Parameter) – learning rate for policy update;
 alpha_v (Parameter) – learning rate for the value function;
 alpha_r (Parameter) – learning rate for the reward trace;
 lambda_par (float, 9) – trace decay parameter;
 value_function_features (Features, None) – features used by the value function approximator;
 policy_features (Features, None) – features used by the policy.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Deep ActorCritic Methods¶

class
mushroom.algorithms.actor_critic.deep_actor_critic.
DeepAC
(policy, mdp_info, actor_optimizer, parameters)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Base class for algorithms that uses the reparametrization trick, such as SAC, DDPG and TD3.

__init__
(policy, mdp_info, actor_optimizer, parameters)[source]¶ Constructor.
Parameters:  actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 parameters – policy parameters to be optimized.

_optimize_actor_parameters
(loss)[source]¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
A2C
(mdp_info, policy, critic_params, actor_optimizer, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Advantage Actor Critic algorithm (A2C). Synchronous version of the A3C algorithm. “Asynchronous Methods for Deep Reinforcement Learning”. Mnih V. et. al.. 2016.

__init__
(mdp_info, policy, critic_params, actor_optimizer, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm
 critic_params (dict) – parameters of the critic approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 ent_coeff (float, 0) – coefficient for the entropy penalty;
 max_grad_norm (float, None) – maximum norm for gradient clipping. If None, no clipping will be performed, unless specified otherwise in actor_optimizer;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
DDPG
(mdp_info, policy_class, policy_params, batch_size, initial_replay_size, max_replay_size, tau, critic_params, actor_params, actor_optimizer, policy_delay=1, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Deep Deterministic Policy Gradient algorithm. “Continuous Control with Deep Reinforcement Learning”. Lillicrap T. P. et al.. 2016.

__init__
(mdp_info, policy_class, policy_params, batch_size, initial_replay_size, max_replay_size, tau, critic_params, actor_params, actor_optimizer, policy_delay=1, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy_class (Policy) – class of the policy;
 policy_params (dict) – parameters of the policy to build;
 batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 tau (float) – value of coefficient for soft updates;
 actor_params (dict) – parameters of the actor approximator to build;
 critic_params (dict) – parameters of the critic approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 policy_delay (int, 1) – the number of updates of the critic after which an actor update is implemented;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator;

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Actionvalues returned by the critic for
next_state
and the action returned by the actor.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
TD3
(mdp_info, policy_class, policy_params, batch_size, initial_replay_size, max_replay_size, tau, critic_params, actor_params, actor_optimizer, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.actor_critic.deep_actor_critic.ddpg.DDPG
Twin Delayed DDPG algorithm. “Addressing Function Approximation Error in ActorCritic Methods”. Fujimoto S. et al.. 2018.

__init__
(mdp_info, policy_class, policy_params, batch_size, initial_replay_size, max_replay_size, tau, critic_params, actor_params, actor_optimizer, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy_class (Policy) – class of the policy;
 policy_params (dict) – parameters of the policy to build;
 batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 tau (float) – value of coefficient for soft updates;
 critic_params (dict) – parameters of the critic approximator to build;
 actor_params (dict) – parameters of the actor approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 policy_delay (int, 2) – the number of updates of the critic after which an actor update is implemented;
 noise_std (float, 2) – standard deviation of the noise used for policy smoothing;
 noise_clip (float, 5) – maximum absolute value for policy smoothing noise;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_next_q
(next_state, absorbing)[source]¶ Parameters:  next_state (np.ndarray) – the states where next action has to be evaluated;
 absorbing (np.ndarray) – the absorbing flag for the states in
next_state
.
Returns: Actionvalues returned by the critic for
next_state
and the action returned by the actor.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
SAC
(mdp_info, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, target_entropy=None, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.actor_critic.deep_actor_critic.deep_actor_critic.DeepAC
Soft ActorCritic algorithm. “Soft ActorCritic Algorithms and Applications”. Haarnoja T. et al.. 2019.

__init__
(mdp_info, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, target_entropy=None, critic_fit_params=None)[source]¶ Constructor.
Parameters:  batch_size (int) – the number of samples in a batch;
 initial_replay_size (int) – the number of samples to collect before starting the learning;
 max_replay_size (int) – the maximum number of samples in the replay memory;
 warmup_transitions (int) – number of samples to accumulate in the replay memory to start the policy fitting;
 tau (float) – value of coefficient for soft updates;
 lr_alpha (float) – Learning rate for the entropy coefficient;
 actor_mu_params (dict) – parameters of the actor mean approximator to build;
 actor_sigma_params (dict) – parameters of the actor sigm approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 critic_params (dict) – parameters of the critic approximator to build;
 target_entropy (float, None) – target entropy for the policy, if None a default value is computed ;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_optimize_actor_parameters
(loss)¶ Method used to update actor parameters to maximize a given loss.
Parameters: loss (torch.tensor) – the loss computed by the algorithm.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
TRPO
(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e10, quiet=True, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Trust Region Policy optimization algorithm. “Trust Region Policy Optimization”. Schulman J. et al.. 2015.

__init__
(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e10, quiet=True, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm
 critic_params (dict) – parameters of the critic approximator to build;
 ent_coeff (float, 0) – coefficient for the entropy penalty;
 max_kl (float, 001) – maximum kl allowed for every policy update;
 float (lam) – lambda coefficient used by generalized advantage estimation;
 n_epochs_line_search (int, 10) – maximum number of iterations of the line search algorithm;
 n_epochs_cg (int, 10) – maximum number of iterations of the conjugate gradient algorithm;
 cg_damping (float, 1e2) – damping factor for the conjugate gradient algorithm;
 cg_residual_tol (float, 1e10) – conjugate gradient residual tolerance;
 quiet (bool, True) – if true, the algorithm will print debug information;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.actor_critic.deep_actor_critic.
PPO
(mdp_info, policy, critic_params, actor_optimizer, n_epochs_policy, batch_size, eps_ppo, lam, quiet=True, critic_fit_params=None)[source]¶ Bases:
mushroom.algorithms.agent.Agent
Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al.. 2017.

__init__
(mdp_info, policy, critic_params, actor_optimizer, n_epochs_policy, batch_size, eps_ppo, lam, quiet=True, critic_fit_params=None)[source]¶ Constructor.
Parameters:  policy (TorchPolicy) – torch policy to be learned by the algorithm
 critic_params (dict) – parameters of the critic approximator to build;
 actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
 n_epochs_policy (int) – number of policy updates for every dataset;
 batch_size (int) – size of minibatches for every optimization step
 eps_ppo (float) – value for probability ratio clipping;
 float (lam) – lambda coefficient used by generalized advantage estimation;
 quiet (bool, True) – if true, the algorithm will print debug information;
 critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.
