How to make a simple experiment

The main purpose of MushroomRL is to simplify the scripting of RL experiments. A standard example of a script to run an experiment in MushroomRL, consists of:

an initial part where the setting of the experiment are specified;
a middle part where the experiment is run;
a final part where operations like evaluation, plot and save can be done.

A RL experiment consists of:

a MDP;
an agent;
a core.

A MDP is the problem to be solved by the agent. It contains the function to move the agent in the environment according to the provided action. The MDP can be simply created with:

import numpy as np
from sklearn.ensemble import ExtraTreesRegressor

from mushroom_rl.algorithms.value import FQI
from mushroom_rl.core import Core
from mushroom_rl.environments import CarOnHill
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.dataset import compute_J
from mushroom_rl.utils.parameters import Parameter

mdp = CarOnHill()

A MushroomRL agent is the algorithm that is run to learn in the MDP. It consists of a policy approximator and of the methods to improve the policy during the learning. It also contains the features to extract in the case of MDP with continuous state and action spaces. An agent can be defined this way:

# Policy
epsilon = Parameter(value=1.)
pi = EpsGreedy(epsilon=epsilon)

# Approximator
approximator_params = dict(input_shape=mdp.info.observation_space.shape,
                           n_actions=mdp.info.action_space.n,
                           n_estimators=50,
                           min_samples_split=5,
                           min_samples_leaf=2)
approximator = ExtraTreesRegressor

# Agent
agent = FQI(mdp.info, pi, approximator, n_iterations=20,
            approximator_params=approximator_params)

This piece of code creates the policy followed by the agent (e.g. \(\varepsilon\)-greedy) with \(\varepsilon = 1\). Then, the policy approximator is created specifying the parameters to create it and the class (in this case, the ExtraTreesRegressor class of scikit-learn is used). Eventually, the agent is created calling the algorithm class and providing the approximator and the policy, together with parameters used by the algorithm.

To run the experiment, the core module has to be used. This module requires the agent and the MDP object and contains the function to learn in the MDP and evaluate the learned policy. It can be created with:

core = Core(agent, mdp)

Once the core has been created, the agent can be trained collecting a dataset and fitting the policy:

core.learn(n_episodes=1000, n_episodes_per_fit=1000)

In this case, the agent’s policy is fitted only once, after that 1000 episodes have been collected. This is a common practice in batch RL algorithms such as FQI where, initially, samples are randomly collected and then the policy is fitted using the whole dataset of collected samples.

Eventually, some operations to evaluate the learned policy can be done. This way the user can, for instance, compute the performance of the agent through the collected rewards during an evaluation run. Fixing \(\varepsilon = 0\), the greedy policy is applied starting from the provided initial states, then the average cumulative discounted reward is returned.

pi.set_epsilon(Parameter(0.))
initial_state = np.array([[-.5, 0.]])
dataset = core.evaluate(initial_states=initial_state)

print(compute_J(dataset, gamma=mdp.info.gamma))