How to make an advanced experiment¶
Continuous MDPs are a challenging class of problems to solve in RL. In these problems, a tabular regressor is not enough to approximate the Q-function, since there are an infinite number of states/actions. The solution to solve them is to use a function approximator (e.g. neural network) fed with the raw values of states and actions. In the case a linear approximator is used, it is convenient to enlarge the input space with the space of non-linear features extracted from the raw values. This way, the linear approximator is often able to solve the MDPs, despite its simplicity. Many RL algorithms rely on the use of a linear approximator to solve a MDP, therefore the use of features is very important. This tutorial shows how to solve a continuous MDP in MushroomRL using an algorithm that requires the use of a linear approximator.
Initially, the MDP and the policy are created:
import numpy as np from mushroom_rl.algorithms.value import SARSALambdaContinuous from mushroom_rl.approximators.parametric import LinearApproximator from mushroom_rl.core import Core from mushroom_rl.features import Features from mushroom_rl.features.tiles import Tiles from mushroom_rl.policy import EpsGreedy from mushroom_rl.utils.callbacks import CollectDataset from mushroom_rl.utils.parameters import Parameter from mushroom_rl.environments import Gym # MDP mdp = Gym(name='MountainCar-v0', horizon=np.inf, gamma=1.) # Policy epsilon = Parameter(value=0.) pi = EpsGreedy(epsilon=epsilon)
This is an environment created with the MushroomRL interface to the OpenAI Gym
library. Each environment offered by OpenAI Gym can be created this way simply
providing the corresponding id in the
name parameter, except for the Atari
that are managed by a separate class.
After the creation of the MDP, the tiles features are created:
n_tilings = 10 tilings = Tiles.generate(n_tilings, [10, 10], mdp.info.observation_space.low, mdp.info.observation_space.high) features = Features(tilings=tilings) approximator_params = dict(input_shape=(features.size,), output_shape=(mdp.info.action_space.n,), n_actions=mdp.info.action_space.n)
In this example, we use sparse coding by means of tiles features. The
generate method generates
n_tilings grids of 10x10 tilings evenly spaced
(the way the tilings are created is explained in “Reinforcement Learning: An Introduction”,
Sutton & Barto, 1998). Eventually, the grid is passed to the
factory method that returns the features class.
MushroomRL offers other type of features such a radial basis functions and polynomial features. The former have also a faster implementation written in Tensorflow that can be used transparently.
Then, the agent is created as usual, but this time passing the feature to it. It is important to notice that the learning rate is divided by the number of tilings for the correctness of the update (see “Reinforcement Learning: An Introduction”, Sutton & Barto, 1998 for details). After that, the learning is run as usual:
learning_rate = Parameter(.1 / n_tilings) agent = SARSALambdaContinuous(mdp.info, pi, LinearApproximator, approximator_params=approximator_params, learning_rate=learning_rate, lambda_coeff=.9, features=features) # Algorithm collect_dataset = CollectDataset() callbacks = [collect_dataset] core = Core(agent, mdp, callbacks_episode=callbacks) # Train core.learn(n_episodes=100, n_steps_per_fit=1)
To visualize the learned policy the rendering method of OpenAI Gym is used. To
activate the rendering in the environments that supports it, it is necessary to