Policy search¶
Policy gradient¶

class
mushroom.algorithms.policy_search.policy_gradient.
REINFORCE
(policy, mdp_info, learning_rate, features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
REINFORCE algorithm. “Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning”, Williams R. J.. 1992.

__init__
(policy, mdp_info, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.policy_search.policy_gradient.
GPOMDP
(policy, mdp_info, learning_rate, features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
GPOMDP algorithm. “InfiniteHorizon PolicyGradient Estimation”. Baxter J. and Bartlett P. L.. 2001.

__init__
(policy, mdp_info, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.policy_search.policy_gradient.
eNAC
(policy, mdp_info, learning_rate, features=None, critic_features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
Episodic Natural Actor Critic algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J. 2013.

__init__
(policy, mdp_info, learning_rate, features=None, critic_features=None)[source]¶ Constructor.
Parameters: critic_features (Features, None) – features used by the critic.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

BlackBox optimization¶

class
mushroom.algorithms.policy_search.black_box_optimization.
RWR
(distribution, policy, mdp_info, beta, features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
RewardWeighted Regression algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(distribution, policy, mdp_info, beta, features=None)[source]¶ Constructor.
Parameters: beta (float) – the temperature for the exponential reward transformation.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.policy_search.black_box_optimization.
PGPE
(distribution, policy, mdp_info, learning_rate, features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Policy Gradient with Parameter Exploration algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(distribution, policy, mdp_info, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (Parameter) – the learning rate for the gradient step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom.algorithms.policy_search.black_box_optimization.
REPS
(distribution, policy, mdp_info, eps, features=None)[source]¶ Bases:
mushroom.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Episodic Relative Entropy Policy Search algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(distribution, policy, mdp_info, eps, features=None)[source]¶ Constructor.
Parameters: eps (float) – the maximum admissible value for the KullbackLeibler divergence between the new distribution and the previous one at each update step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.
