Actor-Critic

Classical Actor-Critic Methods

class COPDAC_Q(*args, **kwargs)[source]

Bases: Agent

Compatible off-policy deterministic actor-critic algorithm. “Deterministic Policy Gradient Algorithms” Silver D. et al. 2014.

__init__(mdp_info, policy, mu, alpha_theta, alpha_omega, alpha_v, value_function_features=None)[source]

Constructor.

Parameters:

mu (Regressor) – regressor that describe the deterministic policy to be learned i.e., the deterministic mapping between state and action.
alpha_theta ([float, Parameter]) – learning rate for policy update;
alpha_omega ([float, Parameter]) – learning rate for the advantage function;
alpha_v ([float, Parameter]) – learning rate for the value function;
value_function_features (Features, None) – features used by the value function approximator;

class StochasticAC(*args, **kwargs)[source]

Bases: Agent

Stochastic Actor critic in the episodic setting as presented in: “Model-Free Reinforcement Learning with Continuous Action in Practice” Degris T. et al. 2012.

__init__(mdp_info, policy, alpha_theta, alpha_v, lambda_par=0.9, value_function_features=None)[source]

Constructor.

Parameters:

alpha_theta ([float, Parameter]) – learning rate for policy update;
alpha_v ([float, Parameter]) – learning rate for the value function;
lambda_par ([float, Parameter], .9) – trace decay parameter;
value_function_features (Features, None) – features used by the value function approximator.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

class StochasticAC_AVG(*args, **kwargs)[source]

Bases: StochasticAC

Stochastic Actor critic in the average reward setting as presented in: “Model-Free Reinforcement Learning with Continuous Action in Practice”. Degris T. et al.. 2012.

__init__(mdp_info, policy, alpha_theta, alpha_v, alpha_r, lambda_par=0.9, value_function_features=None)[source]

Constructor.

Parameters:: alpha_r (Parameter) – learning rate for the reward trace.

Deep Actor-Critic Methods

class OnPolicyDeepAC(*args, **kwargs)[source]: Bases: Agent

class DeepAC(*args, **kwargs)[source]

Bases: Agent

Base class for off policy deep actor-critic algorithms. These algorithms use the reparametrization trick, such as SAC, DDPG and TD3.

__init__(mdp_info, policy, actor_optimizer, parameters, backend='torch')[source]

Constructor.

Parameters:

actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
parameters (list) – policy parameters to be optimized.

_optimize_actor_parameters(loss)[source]

Method used to update actor parameters to maximize a given loss.

Parameters:: loss (torch.tensor) – the loss computed by the algorithm.

class A2C(*args, **kwargs)[source]

Bases: DeepAC

Advantage Actor Critic algorithm (A2C). Synchronous version of the A3C algorithm. “Asynchronous Methods for Deep Reinforcement Learning”. Mnih V. et al. 2016.

__init__(mdp_info, policy, actor_optimizer, critic_params, ent_coeff, max_grad_norm=None, critic_fit_params=None)[source]

Constructor.

Parameters:

policy (TorchPolicy) – torch policy to be learned by the algorithm;
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
ent_coeff ([float, Parameter], 0) – coefficient for the entropy penalty;
max_grad_norm (float, None) – maximum norm for gradient clipping. If None, no clipping will be performed, unless specified otherwise in actor_optimizer;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

class DDPG(*args, **kwargs)[source]

Bases: DeepAC

Deep Deterministic Policy Gradient algorithm. “Continuous Control with Deep Reinforcement Learning”. Lillicrap T. P. et al. 2016.

__init__(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=1, critic_fit_params=None, actor_predict_params=None, critic_predict_params=None)[source]

Constructor.

Parameters:

policy_class (Policy) – class of the policy;
policy_params (dict) – parameters of the policy to build;
actor_params (dict) – parameters of the actor approximator to build;
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
batch_size ([int, Parameter]) – the number of samples in a batch;
initial_replay_size (int) – the number of samples to collect before starting the learning;
max_replay_size (int) – the maximum number of samples in the replay memory;
tau ((float, Parameter)) – value of coefficient for soft updates;
policy_delay ([int, Parameter], 1) – the number of updates of the critic after which an actor update is implemented;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator;
actor_predict_params (dict, None) – parameters for the prediction with the actor approximator;
critic_predict_params (dict, None) – parameters for the prediction with the critic approximator.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

class TD3(*args, **kwargs)[source]

Bases: DDPG

Twin Delayed DDPG algorithm. “Addressing Function Approximation Error in Actor-Critic Methods”. Fujimoto S. et al. 2018.

__init__(mdp_info, policy_class, policy_params, actor_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, tau, policy_delay=2, noise_std=0.2, noise_clip=0.5, critic_fit_params=None)[source]

Constructor.

Parameters:

policy_class (Policy) – class of the policy;
policy_params (dict) – parameters of the policy to build;
actor_params (dict) – parameters of the actor approximator to build;
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
batch_size ([int, Parameter]) – the number of samples in a batch;
initial_replay_size (int) – the number of samples to collect before starting the learning;
max_replay_size (int) – the maximum number of samples in the replay memory;
tau ([float, Parameter]) – value of coefficient for soft updates;
policy_delay ([int, Parameter], 2) – the number of updates of the critic after which an actor update is implemented;
noise_std ([float, Parameter], .2) – standard deviation of the noise used for policy smoothing;
noise_clip ([float, Parameter], .5) – maximum absolute value for policy smoothing noise;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

class SAC(*args, **kwargs)[source]

Bases: DeepAC

Soft Actor-Critic algorithm. “Soft Actor-Critic Algorithms and Applications”. Haarnoja T. et al. 2019.

__init__(mdp_info, actor_mu_params, actor_sigma_params, actor_optimizer, critic_params, batch_size, initial_replay_size, max_replay_size, warmup_transitions, tau, lr_alpha, use_log_alpha_loss=False, log_std_min=-20, log_std_max=2, target_entropy=None, critic_fit_params=None)[source]

Constructor.

Parameters:

actor_mu_params (dict) – parameters of the actor mean approximator to build;
actor_sigma_params (dict) – parameters of the actor sigma approximator to build;
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
batch_size ((int, Parameter)) – the number of samples in a batch;
initial_replay_size (int) – the number of samples to collect before starting the learning;
max_replay_size (int) – the maximum number of samples in the replay memory;
warmup_transitions ([int, Parameter]) – number of samples to accumulate in the replay memory to start the policy fitting;
tau ([float, Parameter]) – value of coefficient for soft updates;
lr_alpha ([float, Parameter]) – Learning rate for the entropy coefficient;
use_log_alpha_loss (bool, False) – whether to use the original implementation loss or the one from the paper;
log_std_min ([float, Parameter]) – Min value for the policy log std;
log_std_max ([float, Parameter]) – Max value for the policy log std;
target_entropy (float, None) – target entropy for the policy, if None a default value is computed;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Action-values returned by the critic for next_state and the action returned by the actor.

class TRPO(*args, **kwargs)[source]

Bases: OnPolicyDeepAC

Trust Region Policy optimization algorithm. “Trust Region Policy Optimization”. Schulman J. et al. 2015.

__init__(mdp_info, policy, critic_params, ent_coeff=0.0, max_kl=0.001, lam=1.0, n_epochs_line_search=10, n_epochs_cg=10, cg_damping=0.01, cg_residual_tol=1e-10, critic_fit_params=None, backend='torch')[source]

Constructor.

Parameters:

policy (TorchPolicy) – torch policy to be learned by the algorithm
critic_params (dict) – parameters of the critic approximator to build;
ent_coeff ([float, Parameter], 0) – coefficient for the entropy penalty;
max_kl ([float, Parameter], .001) – maximum kl allowed for every policy update;
float (lam) – lambda coefficient used by generalized advantage estimation;
n_epochs_line_search ([int, Parameter], 10) – maximum number of iterations of the line search algorithm;
n_epochs_cg ([int, Parameter], 10) – maximum number of iterations of the conjugate gradient algorithm;
cg_damping ([float, Parameter], 1e-2) – damping factor for the conjugate gradient algorithm;
cg_residual_tol ([float, Parameter], 1e-10) – conjugate gradient residual tolerance;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

class PPO(*args, **kwargs)[source]

Bases: OnPolicyDeepAC

Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al. 2017.

__init__(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, ent_coeff=0.0, critic_fit_params=None)[source]

Constructor.

Parameters:

policy (TorchPolicy) – torch policy to be learned by the algorithm
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
n_epochs_policy ([int, Parameter]) – number of policy updates for every dataset;
batch_size ([int, Parameter]) – size of minibatches for every optimization step
eps_ppo ([float, Parameter]) – value for probability ratio clipping;
lam ([float, Parameter], 1.) – lambda coefficient used by generalized advantage estimation;
ent_coeff ([float, Parameter], 1.) – coefficient for the entropy regularization term;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.

class PPO_BPTT(*args, **kwargs)[source]

Bases: OnPolicyDeepAC

Backpropagation trough time extension of the Proximal Policy Optimization algorithm. “Proximal Policy Optimization Algorithms”. Schulman J. et al. 2017.

__init__(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, dim_env_state, ent_coeff=0.0, critic_fit_params=None, truncation_length=5, history_length=1, action_history_length=0)[source]

Constructor.

Parameters:

policy (TorchPolicy) – torch policy to be learned by the algorithm
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
n_epochs_policy ([int, Parameter]) – number of policy updates for every dataset;
batch_size ([int, Parameter]) – size of minibatches for every optimization step
eps_ppo ([float, Parameter]) – value for probability ratio clipping;
lam ([float, Parameter], 1.) – lambda coefficient used by generalized advantage estimation;
ent_coeff ([float, Parameter], 1.) – coefficient for the entropy regularization term;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator;
truncation_length (int, 5) – truncation length of the backpropagation through time;
history_length (int, 1) – number of observations stacked as input at each timestep; when greater than 1 the agent stacks the observation history online and the same window is rebuilt for each sequence timestep during the fit, so each sequence entry becomes a (history_length, *obs_shape) stack;
action_history_length (int, 0) – number of previous actions fed to the policy at each timestep; when greater than 0 the agent assembles the previous-action window online and the same window is rebuilt for each sequence timestep during the fit.

static compute_gae(V, s, pi_h, ss, pi_hn, lengths, r, absorbing, last, gamma, lam, action_history=None, next_action_history=None)[source]

Function to compute Generalized Advantage Estimation (GAE) and new value function target over a dataset.

“High-Dimensional Continuous Control Using Generalized Advantage Estimation”. Schulman J. et al.. 2016.

Parameters:

V (Regressor) – the current value function regressor;
s (torch.Tensor) – the set of states in which we want to evaluate the advantage;
ss (torch.Tensor) – the set of next states in which we want to evaluate the advantage;
r (torch.Tensor) – the reward obtained in each transition from state s to state ss;
absorbing (torch.Tensor) – an array of boolean flags indicating if the reached state is absorbing;
last (torch.Tensor) – an array of boolean flags indicating if the reached state is the last of the trajectory;
gamma (float) – the discount factor of the considered problem;
lam (float) – the value for the lamba coefficient used by GEA algorithm.

Returns:

The new estimate for the value function of the next state and the estimated generalized advantage.

class RudinPPO(*args, **kwargs)[source]

Bases: PPO

Extended PPO algorithm Introducing gradient clipping and adaptive learning rate based on KL divergence. “Learning to walk in minutes using massively parallel deep reinforcement learning” Rudin N. et al. 2022.

__init__(mdp_info, policy, actor_optimizer, critic_params, n_epochs_policy, batch_size, eps_ppo, lam, ent_coeff=0.0, critic_fit_params=None, clip_grad_norm=1.0, schedule='adaptive', desired_kl=0.01)[source]

Constructor.

Parameters:

policy (TorchPolicy) – torch policy to be learned by the algorithm
actor_optimizer (dict) – parameters to specify the actor optimizer algorithm;
critic_params (dict) – parameters of the critic approximator to build;
n_epochs_policy ([int, Parameter]) – number of policy updates for every dataset;
batch_size ([int, Parameter]) – size of minibatches for every optimization step
eps_ppo ([float, Parameter]) – value for probability ratio clipping;
lam ([float, Parameter], 1.) – lambda coefficient used by generalized advantage estimation;
ent_coeff ([float, Parameter], 1.) – coefficient for the entropy regularization term;
critic_fit_params (dict, None) – parameters of the fitting algorithm of the critic approximator.