Solvers¶

Dynamic programming¶

mushroom_rl.solvers.dynamic_programming.value_iteration(prob, reward, gamma, eps)[source]¶

Value iteration algorithm to solve a dynamic programming problem.

Parameters:	prob (np.ndarray) – transition probability matrix; reward (np.ndarray) – reward matrix; gamma (float) – discount factor; eps (float) – accuracy threshold.
Returns:	The optimal value of each state.

mushroom_rl.solvers.dynamic_programming.policy_iteration(prob, reward, gamma)[source]¶

Policy iteration algorithm to solve a dynamic programming problem.

Parameters:	prob (np.ndarray) – transition probability matrix; reward (np.ndarray) – reward matrix; gamma (float) – discount factor.
Returns:	The optimal value of each state and the optimal policy.

Car-On-Hill brute-force solver¶

mushroom_rl.solvers.car_on_hill.step(mdp, state, action)[source]¶

Perform a step in the tree.

Parameters:	mdp (CarOnHill) – the Car-On-Hill environment; state (np.array) – the state; action (np.array) – the action.
Returns:	The resulting transition executing `action` in `state`.

mushroom_rl.solvers.car_on_hill.bfs(mdp, frontier, k, max_k)[source]¶

Perform Breadth-First tree search.

Parameters:	mdp (CarOnHill) – the Car-On-Hill environment; frontier (list) – the state at the frontier of the BFS; k (int) – the current depth of the tree; max_k (int) – maximum depth to consider.
Returns:	A tuple containing a flag for the algorithm ending, and the updated depth of the tree.

mushroom_rl.solvers.car_on_hill.solve_car_on_hill(mdp, states, actions, gamma, max_k=50)[source]¶

Solver of the Car-On-Hill environment.

Parameters:	mdp (CarOnHill) – the Car-On-Hill environment; states (np.ndarray) – the states; actions (np.ndarray) – the actions; gamma (float) – the discount factor; max_k (int, 50) – maximum depth to consider.
Returns:	The Q-value for each `state`-`action` tuple.

LQR solver¶

mushroom_rl.solvers.lqr.compute_lqr_feedback_gain(lqr, max_iterations=100)[source]¶

Computes the optimal gain matrix K.

Parameters:	lqr (LQR) – LQR environment; max_iterations (int) – max iterations for convergence.
Returns:	Feedback gain matrix K.

mushroom_rl.solvers.lqr.compute_lqr_P(lqr, K)[source]¶

Computes the P matrix for a given gain matrix K.

Parameters:	lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix.
Returns:	The P matrix of the value function.

mushroom_rl.solvers.lqr.compute_lqr_V(s, lqr, K)[source]¶

Computes the value function at a state s, with the given gain matrix K.

Parameters:	s (np.ndarray) – state; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix.
Returns:	The value function at s

mushroom_rl.solvers.lqr.compute_lqr_V_gaussian_policy(s, lqr, K, Sigma)[source]¶

Computes the value function at a state s, with the given gain matrix K and covariance Sigma.

Parameters:	s (np.ndarray) – state; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix; Sigma (np.ndarray) – covariance matrix.
Returns:	The value function at s.

mushroom_rl.solvers.lqr.compute_lqr_Q(s, a, lqr, K)[source]¶

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K.

Parameters:	s (np.ndarray) – state; a (np.ndarray) – action; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix.
Returns:	The Q function at s, a.

mushroom_rl.solvers.lqr.compute_lqr_Q_gaussian_policy(s, a, lqr, K, Sigma)[source]¶

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K and covariance Sigma.

Parameters:	s (np.ndarray) – state; a (np.ndarray) – action; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix; Sigma (np.ndarray) – covariance matrix.
Returns:	The Q function at (s, a).

mushroom_rl.solvers.lqr.compute_lqr_V_gaussian_policy_gradient_K(s, lqr, K, Sigma)[source]¶

Computes the gradient of the objective function J (equal to the value function V) at state s, w.r.t. the controller matrix K, with the current policy parameters K and Sigma. J(s, K, Sigma) = ValueFunction(s, K, Sigma).

Parameters:	s (np.ndarray) – state; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix; Sigma (np.ndarray) – covariance matrix.
Returns:	The gradient of J wrt to K.

mushroom_rl.solvers.lqr.compute_lqr_Q_gaussian_policy_gradient_K(s, a, lqr, K, Sigma)[source]¶

Computes the gradient of the state-action Value function at state-action pair (s, a), w.r.t. the controller matrix K, with the current policy parameters K and Sigma.

Parameters:	s (np.ndarray) – state; a (np.ndarray) – action; lqr (LQR) – LQR environment; K (np.ndarray) – controller matrix; Sigma (np.ndarray) – covariance matrix.
Returns:	The gradient of Q wrt to K.