How to use the Environment interface

Here we explain in detail the usage of the MushroomRL Environment interface. First, we explain how to use the registration interface. The registration enables the construction of environments from string specification. Then we construct a toy environment to show how it is possible to add new MushroomRL environments.

Old-school enviroment creation

In MushroomRL, environments are simply class objects that extend the environment interface. To create an environment, you can simply call its constructor. You can build the Segway environment as follows:

from mushroom_rl.environments import Segway

env = Segway()

Some environments may have a constructor which is too low level, and you may want to generate a vanilla version of it using as few parameters as possible. An example is the Linear Quadratic Regulator (LQR) environment, which requires a set of matrices to define the linear dynamics and the quadratic cost function. To provide an easier interface, the generate class method is exposed. To generate a simple 3-dimensional LQR problem, with Identity transition and action matrices, and a trivial quadratic cost function, you can use:

from mushroom_rl.environments import LQR

env = LQR.generate(dimensions=3)

See the documentation of LQR.generate to know all the available parameters and effects.

Environment registration

From version 1.7.0, it is possible to register MushroomRL environments and build the environment by specifying only the name.

You can list the registered environments as follows:

from mushroom_rl.core import Environment

env_list = Environment.list_registered()
print(env_list)

Every registered environment can be build using the name. For example, to create the ShipSteering environment you can use:

env = Environment.make('ShipSteering')

To build environments, you may need to pass additional parameters. An example of this is the Gym environment which wraps most OpenAI Gym environments, except the Atari ones, which uses the Atari environment to implement proper preprocessing.

If you want to build the Pendulum-v1 gym environment you need to pass the environment name as a parameter:

env = Environment.make('Gym', 'Pendulum-v1')

However, for environments that are interfaces to other libraries such as Gym, Atari or DMControl a notation with a dot separator is supported. For example to create the pendulum you can also use:

env = Environment.make('Gym.Pendulum-v1')

Or, to create the hopper environment with hop task from DeepMind control suite you can use:

env = Environment.make('DMControl.hopper.hop')

If an environment implements the generate method, it will be used to build the environment instead of the constructor. As the generate method is higher-level interface w.r.t. the constructor, it will require less parameters.

To generate the 3-dimensional LQR problem mentioned in the previous section you can use:

env = Environment.generate('LQR', dimensions=3)

Finally, you can register new environments. Suppose that you have created the environment class MyNewEnv, which extends the base Environment class. You can register the environment as follows:

MyNewEnv.register()

You can put this line of code after the class declaration, or in the __init__.py file of your library. If you do so, the first time you import the file, you will register the environment. Notice that this registration is not saved on disk, thus, you need to register the environment every time the Python interpreter is executed.

Creating a new environment

We show you an example of how to construct a MushroomRL environment. We create a simple room environment, with discrete actions, continuous state space, and mildly stochastic dynamics. The objective is to move the agent from any point of the room towards the goal point. The agent takes a penalty at every step equal to the distance to the objective. When the agent reaches the goal the episode ends. The agent can move in the room by using one of the 4 discrete actions, North, South, West, East.

First of all, we import all the required classes: NumPy for working with the array, the Environment interface and the MDPInfo structure, which contains the basic information about the Environment.

Given that we are implementing a simple visualization function, we import also the viewer class, which is a Pygame wrapper, that can be used to render easily RL environments.

import numpy as np

from mushroom_rl.core import Environment, MDPInfo
from mushroom_rl.utils.spaces import Box, Discrete

from mushroom_rl.utils.viewer import Viewer

Now, we can create the environment class.

We first extend the environment class and create the constructor:

class RoomToyEnv(Environment):
    def __init__(self, size=5., goal=(2.5, 2.5), goal_radius=0.6):

        # Save important environment information
        self._size = size
        self._goal = np.array(goal)
        self._goal_radius = goal_radius

        # Create the action space.
        action_space = Discrete(4)  # 4 actions: N, S, W, E

        # Create the observation space. It's a 2D box of dimension (size x size).
        # You can also specify low and high array, if every component has different limits
        shape = (2,)
        observation_space = Box(0, size, shape)

        # Create the MDPInfo structure, needed by the environment interface
        mdp_info = MDPInfo(observation_space, action_space, gamma=0.99, horizon=100, dt=0.1)

        super().__init__(mdp_info)

        # Create a state class variable to store the current state
        self._state = None

        # Create the viewer
        self._viewer = Viewer(size, size)

It’s important to notice that the superclass constructor needs the information stored in the MDPInfo structure. This structure contains the action and observation space, the discount factor gamma, and the horizon. The horizon is used to cut the trajectories when they are too long. When the horizon is reached the episode is terminated, however, the state might not be absorbing. The absorbing state flag is explicitly set in the environment step function. Also, notice that the Environment superclass has no notion of the environment state, so we need to store it by ourselves. That’s why we create the self._state variable and we initialize it to None. Other environment information such as the goal position and area is stored into class variables.

Now we implement the reset function. This function is called at the beginning of every episode. It’s possible to force the initial state. For this reason, we have to manage two scenarios: when the initial state is given and when it is set to None. If the initial state is not given, we sample randomly among the valid states.

    def reset(self, state=None):

        if state is None:
            # Generate randomly a state inside the state space, but not inside the goal
            self._state = np.random.rand(2) * self._size

            # Check if it's inside the goal radius and repeat the sample if necessary
            while np.linalg.norm(self._state - self._goal) < self._goal_radius:
                self._state = np.random.rand(2) * self._size
        else:
            # If an initial state is provided, set it and return, after checking it's valid.
            assert np.all(state < self._size) and np.all(state > 0)
            assert np.linalg.norm(state - self._goal) > self._goal_radius
            self._state = state

        # Return the current state
        return self._state

Now it’s time to implement the step function, that specifies the transition function of the environment, computes the reward, and signal absorbing states, i.e. states where every action keeps you in the same state, achieving 0 reward. When reaching the absorbing state we cut the trajectory, as their value function is always 0, and no further exploration is needed.

    def step(self, action):
        # convert the action in a N, S, W, E movement
        movement = np.zeros(2)
        if action == 0:
            movement[1] += 0.1
        elif action == 1:
            movement[1] -= 0.1
        elif action == 2:
            movement[0] -= 0.1
        elif action == 3:
            movement[0] += 0.1
        else:
            assert ValueError('The environment has only 4 actions')

        # Apply the movement with some noise:
        self._state += movement + np.random.randn(2)*0.05

        # Clip the state space inside the boundaries.
        low = self.info.observation_space.low
        high = self.info.observation_space.high

        self._state = Environment._bound(self._state, low, high)

        # Compute distance form goal
        goal_distance = np.linalg.norm(self._state - self._goal)

        # Compute the reward as distance penalty from goal
        reward = -goal_distance

        # Set the absorbing flag if goal is reached
        absorbing = goal_distance < self._goal_radius

        # Return all the information + empty dictionary (used to pass additional information)
        return self._state, reward, absorbing, {}

Finally, we implement the render function using our Viewer class. This class wraps Pygame to provide an easy visualization tool for 2D Reinforcement Learning algorithms. The viewer class has many functionalities, but here we simply draw two circles representing the agent and the goal area:

    def render(self, record=False):
        # Draw a red circle for the agent
        self._viewer.circle(self._state, 0.1, color=(255, 0, 0))

        # Draw a green circle for the goal
        self._viewer.circle(self._goal, self._goal_radius, color=(0, 255, 0))

        # Get the image if the record flag is set to true
        frame = self._viewer.get_frame() if record else None

        # Display the image for the control time (0.1 seconds)
        self._viewer.display(self.info.dt)

        return frame

For more information about the viewer, refer to the class documentation.

To conclude our environment, it’s also possible to register it as specified in the previous section of this tutorial:

# Register the class
RoomToyEnv.register()

Learning in the toy environment

Now that we have created our environment, we try to solve it using Reinforcement Learning. The following code uses the True Online SARSA-Lambda algorithm, exploiting a tiles approximator.

We first import all necessary classes and utilities, then we construct the environment (we set the seed for reproducibility).

if __name__ == '__main__':
    from mushroom_rl.core import Core
    from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
    from mushroom_rl.policy import EpsGreedy
    from mushroom_rl.features import Features
    from mushroom_rl.features.tiles import Tiles
    from mushroom_rl.utils.parameters import Parameter
    from mushroom_rl.utils.dataset import compute_J

    # Set the seed
    np.random.seed(1)

    # Create the toy environment with default parameters
    env = Environment.make('RoomToyEnv')

We now proceed then to create the agent policy, which is a linear policy using tiles features, similar to the one used by the Mountain Car experiment from R. Sutton book.

    # Using an epsilon-greedy policy
    epsilon = Parameter(value=0.1)
    pi = EpsGreedy(epsilon=epsilon)

    # Creating a simple agent using linear approximator with tiles
    n_tilings = 5
    tilings = Tiles.generate(n_tilings, [10, 10],
                             env.info.observation_space.low,
                             env.info.observation_space.high)
    features = Features(tilings=tilings)

    learning_rate = Parameter(.1 / n_tilings)

    approximator_params = dict(input_shape=(features.size,),
                               output_shape=(env.info.action_space.n,),
                               n_actions=env.info.action_space.n)

    agent = TrueOnlineSARSALambda(env.info, pi,
                                  approximator_params=approximator_params,
                                  features=features,
                                  learning_rate=learning_rate,
                                  lambda_coeff=.9)

Finally, using the Core class we set up an RL experiment. We first evaluate the initial policy for three episodes on the environment. Then we learn the task using the algorithm build above for 20000 steps. In the end, we evaluate the learned policy for 3 more episodes.

    # Reinforcement learning experiment
    core = Core(agent, env)

    # Visualize initial policy for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value before learning
    J = np.mean(compute_J(dataset, env.info.gamma))
    print(f'Objective function before learning: {J}')

    # Train
    core.learn(n_steps=20000, n_steps_per_fit=1, render=False)

    # Visualize results for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value after learning
    J = np.mean(compute_J(dataset, env.info.gamma))
    print(f'Objective function after learning: {J}')