Policy search¶

Policy gradient¶

class mushroom_rl.algorithms.policy_search.policy_gradient.REINFORCE(mdp_info, policy, learning_rate, features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient

REINFORCE algorithm. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, Williams R. J.. 1992.

__init__(mdp_info, policy, learning_rate, features=None)[source]¶

Constructor.

Parameters:	learning_rate (float) – the learning rate.

_compute_gradient(J)[source]¶

Return the gradient computed by the algorithm.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update(x, u, r)[source]¶

This function is called, when parsing the dataset, at each episode step.

Parameters:	x (np.ndarray) – the state at the current step; u (np.ndarray) – the action at the current step; r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse(sample)¶

Utility to parse the sample.

Parameters:	sample (list) – the current episode step.
Returns:	A tuple containing state, action, reward, next state, absorbing and last flag. If provided, `state` is preprocessed with the features.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters(J)¶

Update the parameters of the policy.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class mushroom_rl.algorithms.policy_search.policy_gradient.GPOMDP(mdp_info, policy, learning_rate, features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient

GPOMDP algorithm. “Infinite-Horizon Policy-Gradient Estimation”. Baxter J. and Bartlett P. L.. 2001.

__init__(mdp_info, policy, learning_rate, features=None)[source]¶

Constructor.

Parameters:	learning_rate (float) – the learning rate.

_compute_gradient(J)[source]¶

Return the gradient computed by the algorithm.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update(x, u, r)[source]¶

This function is called, when parsing the dataset, at each episode step.

Parameters:	x (np.ndarray) – the state at the current step; u (np.ndarray) – the action at the current step; r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse(sample)¶

Utility to parse the sample.

Parameters:	sample (list) – the current episode step.
Returns:	A tuple containing state, action, reward, next state, absorbing and last flag. If provided, `state` is preprocessed with the features.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters(J)¶

Update the parameters of the policy.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class mushroom_rl.algorithms.policy_search.policy_gradient.eNAC(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.policy_gradient.policy_gradient.PolicyGradient

Episodic Natural Actor Critic algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J. 2013.

__init__(mdp_info, policy, learning_rate, features=None, critic_features=None)[source]¶

Constructor.

Parameters:	critic_features (Features, None) – features used by the critic.

_compute_gradient(J)[source]¶

Return the gradient computed by the algorithm.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

_step_update(x, u, r)[source]¶

This function is called, when parsing the dataset, at each episode step.

Parameters:	x (np.ndarray) – the state at the current step; u (np.ndarray) – the action at the current step; r (np.ndarray) – the reward at the current step.

_episode_end_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE updates some data structures).

_init_update()[source]¶: This function is called, when parsing the dataset, at the beginning of each episode. The implementation is dependent on the algorithm (e.g. REINFORCE resets some data structure).

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_parse(sample)¶

Utility to parse the sample.

Parameters:	sample (list) – the current episode step.
Returns:	A tuple containing state, action, reward, next state, absorbing and last flag. If provided, `state` is preprocessed with the features.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

_update_parameters(J)¶

Update the parameters of the policy.

Parameters:	J (list) – list of the cumulative discounted rewards for each episode in the dataset.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

Black-Box optimization¶

class mushroom_rl.algorithms.policy_search.black_box_optimization.RWR(mdp_info, distribution, policy, beta, features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization

Reward-Weighted Regression algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__(mdp_info, distribution, policy, beta, features=None)[source]¶

Constructor.

Parameters:	beta (float) – the temperature for the exponential reward transformation.

_update(Jep, theta)[source]¶

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:	Jep (np.ndarray) – a vector containing the J of the considered trajectories; theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class mushroom_rl.algorithms.policy_search.black_box_optimization.PGPE(mdp_info, distribution, policy, learning_rate, features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization

Policy Gradient with Parameter Exploration algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__(mdp_info, distribution, policy, learning_rate, features=None)[source]¶

Constructor.

Parameters:	learning_rate (Parameter) – the learning rate for the gradient step.

_update(Jep, theta)[source]¶

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:	Jep (np.ndarray) – a vector containing the J of the considered trajectories; theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.

class mushroom_rl.algorithms.policy_search.black_box_optimization.REPS(mdp_info, distribution, policy, eps, features=None)[source]¶

Bases: mushroom_rl.algorithms.policy_search.black_box_optimization.black_box_optimization.BlackBoxOptimization

Episodic Relative Entropy Policy Search algorithm. “A Survey on Policy Search for Robotics”, Deisenroth M. P., Neumann G., Peters J.. 2013.

__init__(mdp_info, distribution, policy, eps, features=None)[source]¶

Constructor.

Parameters:	eps (float) – the maximum admissible value for the Kullback-Leibler divergence between the new distribution and the previous one at each update step.

_update(Jep, theta)[source]¶

Function that implements the update routine of distribution parameters. Every black box algorithms should implement this function with the proper update.

Parameters:	Jep (np.ndarray) – a vector containing the J of the considered trajectories; theta (np.ndarray) – a matrix of policy parameters of the considered trajectories.

_add_save_attr(**attr_dict)¶

Add attributes that should be saved for an agent.

Parameters:	attr_dict (dict) – dictionary of attributes mapped to the method that should be used to save and load them.

_post_load()¶: This method can be overwritten to implement logic that is executed after the loading of the agent.

copy()¶

Returns:	A deepcopy of the agent.

draw_action(state)¶

Return the action to execute in the given state. It is the action returned by the policy or the action set by the algorithm (e.g. in the case of SARSA).

Parameters:	state (np.ndarray) – the state where the agent is.
Returns:	The action to be executed.

episode_start()¶: Called by the agent when a new episode starts.

fit(dataset)¶

Fit step.

Parameters:	dataset (list) – the dataset.

classmethod load(path)¶

Load and deserialize the agent from the given location on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.
Returns:	The loaded agent.

save(path)¶

Serialize and save the agent to the given path on disk.

Parameters:	path (string) – Relative or absolute path to the agents save location.

stop()¶: Method used to stop an agent. Useful when dealing with real world environments, simulators, or to cleanup environments internals after a core learn/evaluate to enforce consistency.