Solvers

Dynamic programming

value_iteration(prob, reward, gamma, eps)[source]

Value iteration algorithm to solve a dynamic programming problem.

Parameters:

prob (np.ndarray) – transition probability matrix;
reward (np.ndarray) – reward matrix;
gamma (float) – discount factor;
eps (float) – accuracy threshold.

Returns:

The optimal value of each state.

policy_iteration(prob, reward, gamma)[source]

Policy iteration algorithm to solve a dynamic programming problem.

Parameters:

prob (np.ndarray) – transition probability matrix;
reward (np.ndarray) – reward matrix;
gamma (float) – discount factor.

Returns:

The optimal value of each state and the optimal policy.

Car-On-Hill brute-force solver

step(mdp, state, action)[source]

Perform a step in the tree.

Parameters:

mdp (CarOnHill) – the Car-On-Hill environment;
state (np.array) – the state;
action (np.array) – the action.

Returns:

The resulting transition executing action in state.

bfs(mdp, frontier, k, max_k)[source]

Perform Breadth-First tree search.

Parameters:

mdp (CarOnHill) – the Car-On-Hill environment;
frontier (list) – the state at the frontier of the BFS;
k (int) – the current depth of the tree;
max_k (int) – maximum depth to consider.

Returns:

A tuple containing a flag for the algorithm ending, and the updated depth of the tree.

solve_car_on_hill(mdp, states, actions, gamma, max_k=50)[source]

Solver of the Car-On-Hill environment.

Parameters:

mdp (CarOnHill) – the Car-On-Hill environment;
states (np.ndarray) – the states;
actions (np.ndarray) – the actions;
gamma (float) – the discount factor;
max_k (int, 50) – maximum depth to consider.

Returns:

The Q-value for each state-action tuple.

LQR solver

compute_lqr_feedback_gain(lqr, max_iterations=100)[source]

Computes the optimal gain matrix K.

Parameters:

lqr (LQR) – LQR environment;
max_iterations (int) – max iterations for convergence.

Returns:

Feedback gain matrix K.

compute_lqr_P(lqr, K)[source]

Computes the P matrix for a given gain matrix K.

Parameters:

lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix.

Returns:

The P matrix of the value function.

compute_lqr_V(s, lqr, K)[source]

Computes the value function at a state s, with the given gain matrix K.

Parameters:

s (np.ndarray) – state;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix.

Returns:

The value function at s

compute_lqr_V_gaussian_policy(s, lqr, K, Sigma)[source]

Computes the value function at a state s, with the given gain matrix K and covariance Sigma.

Parameters:

s (np.ndarray) – state;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix;
Sigma (np.ndarray) – covariance matrix.

Returns:

The value function at s.

compute_lqr_Q(s, a, lqr, K)[source]

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K.

Parameters:

s (np.ndarray) – state;
a (np.ndarray) – action;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix.

Returns:

The Q function at s, a.

compute_lqr_Q_gaussian_policy(s, a, lqr, K, Sigma)[source]

Computes the state-action value function Q at a state-action pair (s, a), with the given gain matrix K and covariance Sigma.

Parameters:

s (np.ndarray) – state;
a (np.ndarray) – action;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix;
Sigma (np.ndarray) – covariance matrix.

Returns:

The Q function at (s, a).

compute_lqr_V_gaussian_policy_gradient_K(s, lqr, K, Sigma)[source]

Computes the gradient of the objective function J (equal to the value function V) at state s, w.r.t. the controller matrix K, with the current policy parameters K and Sigma. J(s, K, Sigma) = ValueFunction(s, K, Sigma).

Parameters:

s (np.ndarray) – state;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix;
Sigma (np.ndarray) – covariance matrix.

Returns:

The gradient of J wrt to K.

compute_lqr_Q_gaussian_policy_gradient_K(s, a, lqr, K, Sigma)[source]

Computes the gradient of the state-action Value function at state-action pair (s, a), w.r.t. the controller matrix K, with the current policy parameters K and Sigma.

Parameters:

s (np.ndarray) – state;
a (np.ndarray) – action;
lqr (LQR) – LQR environment;
K (np.ndarray) – controller matrix;
Sigma (np.ndarray) – covariance matrix.

Returns:

The gradient of Q wrt to K.