Policy search¶
Policy gradient¶

class
mushroom_rl.algorithms.policy_search.policy_gradient.
REINFORCE
(mdp_info, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
REINFORCE algorithm. “Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning”, Williams R. J.. 1992.

__init__
(mdp_info, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.policy_gradient.
GPOMDP
(mdp_info, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
GPOMDP algorithm. “InfiniteHorizon PolicyGradient Estimation”. Baxter J. and Bartlett P. L.. 2001.

__init__
(mdp_info, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (float) – the learning rate.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.policy_gradient.
eNAC
(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient
Episodic Natural Actor Critic algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J. 2013.

__init__
(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶ Constructor.
Parameters: critic_features (Features, None) – features used by the critic.

_compute_gradient
(J)[source]¶ Return the gradient computed by the algorithm.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update
(x, u, r)[source]¶ This function is called, when parsing the dataset, at each episode step.
Parameters:  x (np.ndarray) – the state at the current step;
 u (np.ndarray) – the action at the current step;
 r (np.ndarray) – the reward at the current step.

_episode_end_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update
()[source]¶ This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse
(sample)¶ Utility to parse the sample.
Parameters: sample (list) – the current episode step. Returns: A tuple containing state, action, reward, next state, absorbing and last flag. If provided, state
is preprocessed with the features.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters
(J)¶ Update the parameters of the policy.
Parameters: J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

BlackBox optimization¶

class
mushroom_rl.algorithms.policy_search.black_box_optimization.
RWR
(mdp_info, distribution, policy, beta, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
RewardWeighted Regression algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, beta, features=None)[source]¶ Constructor.
Parameters: beta (float) – the temperature for the exponential reward transformation.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.black_box_optimization.
PGPE
(mdp_info, distribution, policy, learning_rate, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Policy Gradient with Parameter Exploration algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, learning_rate, features=None)[source]¶ Constructor.
Parameters: learning_rate (Parameter) – the learning rate for the gradient step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.


class
mushroom_rl.algorithms.policy_search.black_box_optimization.
REPS
(mdp_info, distribution, policy, eps, features=None)[source]¶ Bases:
mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization
Episodic Relative Entropy Policy Search algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__
(mdp_info, distribution, policy, eps, features=None)[source]¶ Constructor.
Parameters: eps (float) – the maximum admissible value for the KullbackLeibler divergence between the new distribution and the previous one at each update step.

_update
(Jep, theta)[source]¶ Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.
Parameters:  Jep (np.ndarray) – a vector containing the J of the considered trajectories;
 theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr
(**attr_dict)¶ Add attributes that should be saved for an agent.
Parameters: attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load
()¶ This method can be overwritten to implement logic that is executed after the loading of the agent.

copy
()¶ Returns: A deepcopy of the agent.

draw_action
(state)¶ Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).
Parameters: state (np.ndarray) – the state where the agent is. Returns: The action to be executed.

episode_start
()¶ Called by the agent when a new episode starts.

fit
(dataset)¶ Fit step.
Parameters: dataset (list) – the dataset.

classmethod
load
(path)¶ Load and deserialize the agent from the given location on disk.
Parameters: path (string) – Relative or absolute path to the agents save location. Returns: The loaded agent.

save
(path)¶ Serialize and save the agent to the given path on disk.
Parameters: path (string) – Relative or absolute path to the agents save location.

stop
()¶ Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.
