Value-Based

Value based algorithms are algorithms learning a value function. As they do not learn an explicit control policy, but the policy is derived from the value function, they are also called Critic-only.

TD

These are classical temporal difference algorithms for discrete actions. These algorithms cover both tabular methods and function approximations.

class SARSA(*args, **kwargs)[source]

Bases: TD

SARSA algorithm.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:

approximator – the approximator to use to fit the Q-function;
learning_rate (Parameter) – the learning rate.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class SARSALambda(*args, **kwargs)[source]

Bases: TD

The SARSA(lambda) algorithm for finite MDPs.

__init__(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Constructor.

Parameters:

lambda_coeff ([float, Parameter]) – eligibility trace coefficient;
trace (str, 'replacing') – type of eligibility trace to use.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

class ExpectedSARSA(*args, **kwargs)[source]

Bases: TD

Expected SARSA algorithm. “A theoretical and empirical analysis of Expected Sarsa” Seijen H. V. et al. 2009.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:

approximator – the approximator to use to fit the Q-function;
learning_rate (Parameter) – the learning rate.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class QLearning(*args, **kwargs)[source]

Bases: TD

Q-Learning algorithm. “Learning from Delayed Rewards”. Watkins C.J.C.H. 1989.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:

approximator – the approximator to use to fit the Q-function;
learning_rate (Parameter) – the learning rate.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class QLambda(*args, **kwargs)[source]

Bases: TD

Q(Lambda) algorithm. “Learning from Delayed Rewards”. Watkins C.J.C.H. 1989.

__init__(mdp_info, policy, learning_rate, lambda_coeff, trace='replacing')[source]

Constructor.

Parameters:

lambda_coeff ([float, Parameter]) – eligibility trace coefficient;
trace (str, 'replacing') – type of eligibility trace to use.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

class DoubleQLearning(*args, **kwargs)[source]

Bases: TD

Double Q-Learning algorithm. “Double Q-Learning”. Hasselt H. V. 2010.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:

approximator – the approximator to use to fit the Q-function;
learning_rate (Parameter) – the learning rate.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class SpeedyQLearning(*args, **kwargs)[source]

Bases: TD

Speedy Q-Learning algorithm. “Speedy Q-Learning” Ghavamzadeh et. al. 2011.

__init__(mdp_info, policy, learning_rate)[source]

Constructor.

Parameters:

approximator – the approximator to use to fit the Q-function;
learning_rate (Parameter) – the learning rate.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class RLearning(*args, **kwargs)[source]

Bases: TD

R-Learning algorithm. “A Reinforcement Learning Method for Maximizing Undiscounted Rewards”. Schwartz A. 1993.

__init__(mdp_info, policy, learning_rate, beta)[source]

Constructor.

Parameters:: beta ([float, Parameter]) – beta coefficient.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class WeightedQLearning(*args, **kwargs)[source]

Bases: TD

Weighted Q-Learning algorithm. “Estimating the Maximum Expected Value through Gaussian Approximation” D’Eramo C. et al. 2016.

__init__(mdp_info, policy, learning_rate, sampling=True, precision=1000)[source]

Constructor.

Parameters:

sampling (bool, True) – use the approximated version to speed up the computation;
precision (int, 1000) – number of samples to use in the approximated version.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

_next_q(next_state)[source]

Parameters:: next_state (np.ndarray) – the state where next action has to be evaluated.
Returns:: The weighted estimator value in next_state.

class MaxminQLearning(*args, **kwargs)[source]

Bases: TD

Maxmin Q-Learning algorithm without replay memory. “Maxmin Q-learning: Controlling the Estimation Bias of Q-learning” Lan Q. et al. 2019.

__init__(mdp_info, policy, learning_rate, n_tables)[source]

Constructor.

Parameters:: n_tables (int) – number of tables in the ensemble.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

class RQLearning(*args, **kwargs)[source]

Bases: TD

RQ-Learning algorithm. “Exploiting Structure and Uncertainty of Bellman Updates in Markov Decision Processes”. Tateo D. et al. 2017.

__init__(mdp_info, policy, learning_rate, off_policy=False, beta=None, delta=None)[source]

Constructor.

Parameters:

off_policy (bool, False) – whether to use the off policy setting or the online one;
beta ([float, Parameter], None) – beta coefficient;
delta ([float, Parameter], None) – delta coefficient.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

_next_q(next_state)[source]

Parameters:: next_state (np.ndarray) – the state where next action has to be evaluated.
Returns:: The weighted estimator value in ‘next_state’.

class SARSALambdaContinuous(*args, **kwargs)[source]

Bases: TD

Continuous version of SARSA(lambda) algorithm.

__init__(mdp_info, policy, approximator, learning_rate, lambda_coeff, approximator_params=None)[source]

Constructor.

Parameters:: lambda_coeff ([float, Parameter]) – eligibility trace coefficient.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

class TrueOnlineSARSALambda(*args, **kwargs)[source]

Bases: TD

True Online SARSA(lambda) with linear function approximation. “True Online TD(lambda)” Seijen H. V. et al. 2014.

__init__(mdp_info, policy, learning_rate, lambda_coeff, approximator_params=None)[source]

Constructor.

Parameters:: lambda_coeff ([float, Parameter]) – eligibility trace coefficient.

_update(state, action, reward, next_state, absorbing)[source]

Update the Q-table.

Parameters:

state (np.ndarray) – state;
action (np.ndarray) – action;
reward (np.ndarray) – reward;
next_state (np.ndarray) – next state;
absorbing (np.ndarray) – absorbing flag.

episode_start(initial_state, episode_info)[source]

Called by the Core when a new episode starts.

Parameters:

initial_state (Array) – vector representing the initial state of the environment.
episode_info (dict) – a dictionary containing the information at reset, such as context.

Returns:

A tuple containing the policy initial state and, optionally, the policy parameters

Batch TD

These are all batch TD methods, learning the Q-Function using a dataset of interaction with the environment.

class FQI(*args, **kwargs)[source]

Bases: BatchTD

Fitted Q-Iteration algorithm. “Tree-Based Batch Mode Reinforcement Learning”, Ernst D. et al. 2005.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:

n_iterations ([int, Parameter]) – number of iterations to perform for training;
quiet (bool, False) – whether to show the progress bar or not.

class DoubleFQI(*args, **kwargs)[source]

Bases: FQI

Double Fitted Q-Iteration algorithm. “Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems” D’Eramo C. et al. 2017.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:

n_iterations ([int, Parameter]) – number of iterations to perform for training;
quiet (bool, False) – whether to show the progress bar or not.

class BoostedFQI(*args, **kwargs)[source]

Bases: FQI

Boosted Fitted Q-Iteration algorithm. “Boosted Fitted Q-Iteration” Tosatto S. et al. 2017.

__init__(mdp_info, policy, approximator, n_iterations, approximator_params=None, fit_params=None, quiet=False)[source]

Constructor.

Parameters:

n_iterations ([int, Parameter]) – number of iterations to perform for training;
quiet (bool, False) – whether to show the progress bar or not.

class LSPI(*args, **kwargs)[source]

Bases: BatchTD

Least-Squares Policy Iteration algorithm. “Least-Squares Policy Iteration”. Lagoudakis M. G. and Parr R. 2003.

__init__(mdp_info, policy, approximator_params=None, epsilon=0.01, fit_params=None)[source]

Constructor.

Parameters:: epsilon ([float, Parameter], 1e-2) – termination coefficient.

DQN

These methods are value-based Deep Reinforcement learning approaches. They are mostly variations of the DQN algorithm.

class AbstractDQN(*args, **kwargs)[source]

Bases: Agent

Abstract class for every DQN-based approach.

__init__(mdp_info, policy, approximator, approximator_params, batch_size, target_update_frequency, replay_memory=None, initial_replay_size=500, max_replay_size=5000, fit_params=None, predict_params=None, clip_reward=False, history_length=1)[source]

Constructor.

Parameters:

approximator (object) – the approximator to use to fit the Q-function;
approximator_params (dict) – parameters of the approximator to build;
batch_size ([int, Parameter]) – the number of samples in a batch;
target_update_frequency (int) – the number of samples collected between each update of the target network;
replay_memory ([dict, ReplayMemory, PrioritizedReplayMemory, None]) – if a dict, must have keys ‘class’ and ‘params’ and the class will be instantiated with mdp_info and agent_info; if already an instance, it is used directly; if None a default ReplayMemory is created;
initial_replay_size (int) – the number of samples to collect before starting the learning;
max_replay_size (int) – the maximum number of samples in the replay memory;
fit_params (dict, None) – parameters of the fitting algorithm of the approximator;
predict_params (dict, None) – parameters for the prediction with the approximator;
clip_reward (bool, False) – whether to clip the reward or not;
history_length (int, 1) – number of consecutive observation stacked as policy input.

_update_target()[source]: Update the target network.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Maximum action-value for each state in next_state.

class DQN(*args, **kwargs)[source]

Bases: AbstractDQN

Deep Q-Network algorithm. “Human-Level Control Through Deep Reinforcement Learning”. Mnih V. et al. 2015.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Maximum action-value for each state in next_state.

class DoubleDQN(*args, **kwargs)[source]

Bases: DQN

Double DQN algorithm. “Deep Reinforcement Learning with Double Q-Learning”. Hasselt H. V. et al. 2016.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Maximum action-value for each state in next_state.

class AveragedDQN(*args, **kwargs)[source]

Bases: AbstractDQN

Averaged-DQN algorithm. “Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning”. Anschel O. et al. 2017.

__init__(mdp_info, policy, approximator, n_approximators, **params)[source]

Constructor.

Parameters:: n_approximators (int) – the number of target approximators to store.

_update_target()[source]: Update the target network.

_next_q(next_state, absorbing)[source]

Parameters:

next_state (torch.Tensor) – the states where next action has to be evaluated;
absorbing (torch.Tensor) – the absorbing flag for the states in next_state.

Returns:

Maximum action-value for each state in next_state.

class MaxminDQN(*args, **kwargs)[source]

Bases: DQN

MaxminDQN algorithm. “Maxmin Q-learning: Controlling the Estimation Bias of Q-learning” Lan Q. et al. 2020.

__init__(mdp_info, policy, approximator, n_approximators, **params)[source]

Constructor.

Parameters:: n_approximators (int) – the number of approximators in the ensemble.

_update_target()[source]: Update the target network.

class DuelingDQN(*args, **kwargs)[source]

Bases: DQN

Dueling DQN algorithm. “Dueling Network Architectures for Deep Reinforcement Learning” Wang Z. et al. 2016.

__init__(mdp_info, policy, approximator_params, avg_advantage=True, **params)[source]: Constructor.

class CategoricalDQN(*args, **kwargs)[source]

Bases: AbstractCategoricalDQN

Categorical DQN algorithm. “A Distributional Perspective on Reinforcement Learning” Bellemare M. et al. 2017.

__init__(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, **params)[source]

Constructor.

Parameters:

n_atoms (int) – number of atoms;
v_min (float) – minimum value of value-function;
v_max (float) – maximum value of value-function.

class NoisyDQN(*args, **kwargs)[source]

Bases: DQN

Noisy DQN algorithm. “Noisy networks for exploration” Fortunato M. et al. 2018.

__init__(mdp_info, policy, approximator_params, **params)[source]: Constructor.

class QuantileDQN(*args, **kwargs)[source]

Bases: AbstractDQN

Quantile Regression DQN algorithm. “Distributional Reinforcement Learning with Quantile Regression” Dabney W. et al. 2018.

__init__(mdp_info, policy, approximator_params, n_quantiles, **params)[source]

Constructor.

Parameters:: n_quantiles (int) – number of quantiles.

class Rainbow(*args, **kwargs)[source]

Bases: AbstractCategoricalDQN

Rainbow algorithm. “Rainbow: Combining Improvements in Deep Reinforcement Learning” Hessel M. et al. 2018.

__init__(mdp_info, policy, approximator_params, n_atoms, v_min, v_max, n_steps_return, alpha_coeff, beta, sigma_coeff=0.5, **params)[source]

Constructor.

Parameters:

n_atoms (int) – number of atoms;
v_min (float) – minimum value of value-function;
v_max (float) – maximum value of value-function;
n_steps_return (int) – the number of steps to consider to compute the n-return;
alpha_coeff (float) – prioritization exponent for prioritized experience replay;
beta (Parameter) – importance sampling coefficient for prioritized experience replay;
sigma_coeff (float, .5) – sigma0 coefficient for noise initialization in noisy layers.