How to use the Environment interface
Here we explain in detail the usage of the MushroomRL Environment interface. First, we explain how to use the registration interface. The registration enables the construction of environments from string specification. Then we construct a toy environment to show how it is possible to add new MushroomRL environments.
Old-school enviroment creation
In MushroomRL, environments are simply class objects that extend the environment interface. To create an environment, you can simply call its constructor. You can build the Segway environment as follows:
from mushroom_rl.environments import Segway
env = Segway()
Some environments may have a constructor which is too low level, and you may want to generate a vanilla version of it
using as few parameters as possible.
An example is the Linear Quadratic Regulator (LQR) environment, which requires a set of matrices to define the linear
dynamics and the quadratic cost function. To provide an easier interface, the generate
class method is exposed. To
generate a simple 3-dimensional LQR problem, with Identity transition and action matrices, and a trivial
quadratic cost function, you can use:
from mushroom_rl.environments import LQR
env = LQR.generate(dimensions=3)
See the documentation of LQR.generate
to know all the available parameters and effects.
Environment registration
From version 1.7.0, it is possible to register MushroomRL environments and build the environment by specifying only the name.
You can list the registered environments as follows:
from mushroom_rl.core import Environment
env_list = Environment.list_registered()
print(env_list)
Every registered environment can be build using the name. For example, to create the ShipSteering environment you can use:
env = Environment.make('ShipSteering')
To build environments, you may need to pass additional parameters.
An example of this is the Gym
environment which wraps most OpenAI Gym environments, except the Atari ones, which
uses the Atari
environment to implement proper preprocessing.
If you want to build the Pendulum-v1
gym environment you need to pass the environment name as a parameter:
env = Environment.make('Gym', 'Pendulum-v1')
However, for environments that are interfaces to other libraries such as Gym
, Atari
or DMControl
a notation
with a dot separator is supported. For example to create the pendulum you can also use:
env = Environment.make('Gym.Pendulum-v1')
Or, to create the hopper
environment with hop
task from DeepMind control suite you can use:
env = Environment.make('DMControl.hopper.hop')
If an environment implements the generate method, it will be used to build the environment instead of the constructor. As the generate method is higher-level interface w.r.t. the constructor, it will require less parameters.
To generate the 3-dimensional LQR problem mentioned in the previous section you can use:
env = Environment.generate('LQR', dimensions=3)
Finally, you can register new environments. Suppose that you have created the environment class MyNewEnv
, which
extends the base Environment
class. You can register the environment as follows:
MyNewEnv.register()
You can put this line of code after the class declaration, or in the __init__.py
file of your library.
If you do so, the first time you import the file, you will register the environment. Notice that this registration is
not saved on disk, thus, you need to register the environment every time the Python interpreter is executed.
Creating a new environment
We show you an example of how to construct a MushroomRL environment. We create a simple room environment, with discrete actions, continuous state space, and mildly stochastic dynamics. The objective is to move the agent from any point of the room towards the goal point. The agent takes a penalty at every step equal to the distance to the objective. When the agent reaches the goal the episode ends. The agent can move in the room by using one of the 4 discrete actions, North, South, West, East.
First of all, we import all the required classes: NumPy for working with the array, the Environment interface and the MDPInfo structure, which contains the basic information about the Environment.
Given that we are implementing a simple visualization function, we import also the viewer class, which is a Pygame wrapper, that can be used to render easily RL environments.
import numpy as np
from mushroom_rl.core import Environment, MDPInfo
from mushroom_rl.rl_utils.spaces import Box, Discrete
from mushroom_rl.utils.viewer import Viewer
Now, we can create the environment class.
We first extend the environment class and create the constructor:
class RoomToyEnv(Environment):
def __init__(self, size=5., goal=(2.5, 2.5), goal_radius=0.6):
# Save important environment information
self._size = size
self._goal = np.array(goal)
self._goal_radius = goal_radius
# Create the action space.
action_space = Discrete(4) # 4 actions: N, S, W, E
# Create the observation space. It's a 2D box of dimension (size x size).
# You can also specify low and high array, if every component has different limits
shape = (2,)
observation_space = Box(0, size, shape)
# Create the MDPInfo structure, needed by the environment interface
mdp_info = MDPInfo(observation_space, action_space, gamma=0.99, horizon=100, dt=0.1)
super().__init__(mdp_info)
# Create a state class variable to store the current state
self._state = None
# Create the viewer
self._viewer = Viewer(size, size)
It’s important to notice that the superclass constructor needs the information stored in the MDPInfo
structure.
This structure contains the action and observation space, the discount factor gamma
, and the horizon.
The horizon is used to cut the trajectories when they are too long. When the horizon is reached the episode is
terminated, however, the state might not be absorbing. The absorbing state flag is explicitly set in the environment step
function.
Also, notice that the Environment
superclass has no notion of the environment state, so we need to store it by
ourselves. That’s why we create the self._state
variable and we initialize it to None
.
Other environment information such as the goal position and area is stored into class variables.
Now we implement the reset function. This function is called at the beginning of every episode. It’s possible to force the initial state. For this reason, we have to manage two scenarios: when the initial state is given and when it is set to None. If the initial state is not given, we sample randomly among the valid states.
def reset(self, state=None):
if state is None:
# Generate randomly a state inside the state space, but not inside the goal
self._state = np.random.rand(2) * self._size
# Check if it's inside the goal radius and repeat the sample if necessary
while np.linalg.norm(self._state - self._goal) < self._goal_radius:
self._state = np.random.rand(2) * self._size
else:
# If an initial state is provided, set it and return, after checking it's valid.
assert np.all(state < self._size) and np.all(state > 0)
assert np.linalg.norm(state - self._goal) > self._goal_radius
self._state = state
# Return the current state
return self._state
Now it’s time to implement the step function, that specifies the transition function of the environment, computes the reward, and signal absorbing states, i.e. states where every action keeps you in the same state, achieving 0 reward. When reaching the absorbing state we cut the trajectory, as their value function is always 0, and no further exploration is needed.
def step(self, action):
# convert the action in a N, S, W, E movement
movement = np.zeros(2)
if action == 0:
movement[1] += 0.1
elif action == 1:
movement[1] -= 0.1
elif action == 2:
movement[0] -= 0.1
elif action == 3:
movement[0] += 0.1
else:
assert ValueError('The environment has only 4 actions')
# Apply the movement with some noise:
self._state += movement + np.random.randn(2)*0.05
# Clip the state space inside the boundaries.
low = self.info.observation_space.low
high = self.info.observation_space.high
self._state = Environment._bound(self._state, low, high)
# Compute distance form goal
goal_distance = np.linalg.norm(self._state - self._goal)
# Compute the reward as distance penalty from goal
reward = -goal_distance
# Set the absorbing flag if goal is reached
absorbing = goal_distance < self._goal_radius
# Return all the information + empty dictionary (used to pass additional information)
return self._state, reward, absorbing, {}
Finally, we implement the render function using our Viewer
class. This class wraps Pygame to provide an easy
visualization tool for 2D Reinforcement Learning algorithms. The viewer class has many functionalities, but here we
simply draw two circles representing the agent and the goal area:
def render(self, record=False):
# Draw a red circle for the agent
self._viewer.circle(self._state, 0.1, color=(255, 0, 0))
# Draw a green circle for the goal
self._viewer.circle(self._goal, self._goal_radius, color=(0, 255, 0))
# Get the image if the record flag is set to true
frame = self._viewer.get_frame() if record else None
# Display the image for the control time (0.1 seconds)
self._viewer.display(self.info.dt)
return frame
For more information about the viewer, refer to the class documentation.
To conclude our environment, it’s also possible to register it as specified in the previous section of this tutorial:
# Register the class
RoomToyEnv.register()
Learning in the toy environment
Now that we have created our environment, we try to solve it using Reinforcement Learning. The following code uses the True Online SARSA-Lambda algorithm, exploiting a tiles approximator.
We first import all necessary classes and utilities, then we construct the environment (we set the seed for reproducibility).
if __name__ == '__main__':
from mushroom_rl.core import Core
from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.features import Features
from mushroom_rl.features.tiles import Tiles
from mushroom_rl.rl_utils.parameters import Parameter
# Set the seed
np.random.seed(1)
# Create the toy environment with default parameters
env = Environment.make('RoomToyEnv')
We now proceed then to create the agent policy, which is a linear policy using tiles features, similar to the one used by the Mountain Car experiment from R. Sutton book.
epsilon = Parameter(value=0.1)
pi = EpsGreedy(epsilon=epsilon)
# Creating a simple agent using linear approximator with tiles
n_tilings = 5
tilings = Tiles.generate(n_tilings, [10, 10],
env.info.observation_space.low,
env.info.observation_space.high)
features = Features(tilings=tilings)
learning_rate = Parameter(.1 / n_tilings)
approximator_params = dict(input_shape=(features.size,),
output_shape=(env.info.action_space.n,),
n_actions=env.info.action_space.n)
agent = TrueOnlineSARSALambda(env.info, pi,
approximator_params=approximator_params,
features=features,
learning_rate=learning_rate,
lambda_coeff=.9)
Finally, using the Core
class we set up an RL experiment. We first evaluate the initial policy for three episodes on the
environment. Then we learn the task using the algorithm build above for 20000 steps.
In the end, we evaluate the learned policy for 3 more episodes.
core = Core(agent, env)
# Visualize initial policy for 3 episodes
dataset = core.evaluate(n_episodes=3, render=True)
# Print the average objective value before learning
J = np.mean(dataset.discounted_return)
print(f'Objective function before learning: {J}')
# Train
core.learn(n_steps=20000, n_steps_per_fit=1, render=False)
# Visualize results for 3 episodes
dataset = core.evaluate(n_episodes=3, render=True)
# Print the average objective value after learning
J = np.mean(dataset.discounted_return)
print(f'Objective function after learning: {J}')