How to use the Environment interface

Here we explain in detail the usage of the MushroomRL Environment interface. First, we explain how to use the registration interface. The registration enables the construction of environments from string specification. Then we construct a toy environment to show how it is possible to add new MushroomRL environments.

Old-school environment creation

In MushroomRL, environments are simply class objects that extend the environment interface. To create an environment, you can simply call its constructor. You can build the Segway environment as follows:

from mushroom_rl.environments import Segway

env = Segway()

Some environments may have a constructor which is too low level, and you may want to generate a vanilla version of it using as few parameters as possible. An example is the Linear Quadratic Regulator (LQR) environment, which requires a set of matrices to define the linear dynamics and the quadratic cost function. To provide an easier interface, the generate class method is exposed. To generate a simple 3-dimensional LQR problem, with Identity transition and action matrices, and a trivial quadratic cost function, you can use:

from mushroom_rl.environments import LQR

env = LQR.generate(dimensions=3)

See the documentation of LQR.generate to know all the available parameters and effects.

Environment registration

From version 1.7.0, it is possible to register MushroomRL environments and build the environment by specifying only the name.

You can list the registered environments as follows:

from mushroom_rl.core import Environment

env_list = Environment.list_registered()
print(env_list)

Every registered environment can be build using the name. For example, to create the ShipSteering environment you can use:

env = Environment.make('ShipSteering')

To build environments, you may need to pass additional parameters. An example of this is the Gymnasium environment which wraps most Gymnasium environments, except the Atari ones, which uses the Atari environment to implement proper preprocessing.

If you want to build the Pendulum-v1 gym environment you need to pass the environment name as a parameter:

env = Environment.make('Gymnasium', 'Pendulum-v1')

However, for environments that are interfaces to other libraries such as Gymnasium, Atari or DMControl a notation with a dot separator is supported. For example to create the pendulum you can also use:

env = Environment.make('Gymnasium.Pendulum-v1')

Or, to create the hopper environment with hop task from DeepMind control suite you can use:

env = Environment.make('DMControl.hopper.hop')

If an environment implements the generate method, it will be used to build the environment instead of the constructor. As the generate method is higher-level interface w.r.t. the constructor, it will require less parameters.

To generate the 3-dimensional LQR problem mentioned in the previous section you can use:

env = Environment.generate('LQR', dimensions=3)

Finally, you can register new environments. Suppose that you have created the environment class MyNewEnv, which extends the base Environment class. You can register the environment as follows:

MyNewEnv.register()

You can put this line of code after the class declaration, or in the __init__.py file of your library. If you do so, the first time you import the file, you will register the environment. Notice that this registration is not saved on disk, thus, you need to register the environment every time the Python interpreter is executed.

Creating a new environment

We show you an example of how to construct a MushroomRL environment. We create a simple room environment, with discrete actions, continuous state space, and mildly stochastic dynamics. The objective is to move the agent from any point of the room towards the goal point. The agent takes a penalty at every step equal to the distance to the objective. When the agent reaches the goal the episode ends. The agent can move in the room by using one of the 4 discrete actions, North, South, West, East.

First of all, we import all the required classes: NumPy for working with the array, the Environment interface and the MDPInfo structure, which contains the basic information about the Environment.

Given that we are implementing a simple visualization function, we import also the viewer class, which is a Pygame wrapper, that can be used to render easily RL environments.

import numpy as np

from mushroom_rl.core import Environment, MDPInfo
from mushroom_rl.core.spaces import Box, Discrete

from mushroom_rl.utils.viewer import Viewer

Now, we can create the environment class.

We first extend the environment class and create the constructor:

class RoomToyEnv(Environment):
    def __init__(self, size=5., goal=(2.5, 2.5), goal_radius=0.6):

        # Save important environment information
        self._size = size
        self._goal = np.array(goal)
        self._goal_radius = goal_radius

        # Create the action space.
        action_space = Discrete(4)  # 4 actions: N, S, W, E

        # Create the observation space. It's a 2D box of dimension (size x size).
        # You can also specify low and high array, if every component has different limits
        shape = (2,)
        observation_space = Box(0, size, shape)

        # Create the MDPInfo structure, needed by the environment interface
        mdp_info = MDPInfo(observation_space, action_space, gamma=0.99, horizon=100, dt=0.1)

        super().__init__(mdp_info)

        # Create a state class variable to store the current state
        self._state = None

        # Create the viewer
        self._viewer = Viewer(size, size)

It’s important to notice that the superclass constructor needs the information stored in the MDPInfo structure. This structure contains the action and observation space, the discount factor gamma, and the horizon. The horizon is used to cut the trajectories when they are too long. When the horizon is reached the episode is terminated, however, the state might not be absorbing. The absorbing state flag is explicitly set in the environment step function. Also, notice that the Environment superclass has no notion of the environment state, so we need to store it by ourselves. That’s why we create the self._state variable and we initialize it to None. Other environment information such as the goal position and area is stored into class variables.

Now we implement the reset function. This function is called at the beginning of every episode. It’s possible to force the initial state. For this reason, we have to manage two scenarios: when the initial state is given and when it is set to None. If the initial state is not given, we sample randomly among the valid states.

    def reset(self, state=None):

        if state is None:
            # Generate randomly a state inside the state space, but not inside the goal
            self._state = np.random.rand(2) * self._size

            # Check if it's inside the goal radius and repeat the sample if necessary
            while np.linalg.norm(self._state - self._goal) < self._goal_radius:
                self._state = np.random.rand(2) * self._size
        else:
            # If an initial state is provided, set it and return, after checking it's valid.
            assert np.all(state < self._size) and np.all(state > 0)
            assert np.linalg.norm(state - self._goal) > self._goal_radius
            self._state = state

        # Return a copy: self._state is mutated in place by step()
        return self._state.copy(), {}

Now it’s time to implement the step function, that specifies the transition function of the environment, computes the reward, and signal absorbing states, i.e. states where every action keeps you in the same state, achieving 0 reward. When reaching the absorbing state we cut the trajectory, as their value function is always 0, and no further exploration is needed.

    def step(self, action):
        # convert the action in a N, S, W, E movement
        movement = np.zeros(2)
        if action == 0:
            movement[1] += 0.1
        elif action == 1:
            movement[1] -= 0.1
        elif action == 2:
            movement[0] -= 0.1
        elif action == 3:
            movement[0] += 0.1
        else:
            raise ValueError('The environment has only 4 actions')

        # Apply the movement with some noise:
        self._state += movement + np.random.randn(2)*0.05

        # Clip the state space inside the boundaries.
        low = self.info.observation_space.low
        high = self.info.observation_space.high

        self._state = self._bound(self._state, low, high)

        # Compute distance form goal
        goal_distance = np.linalg.norm(self._state - self._goal)

        # Compute the reward as distance penalty from goal
        reward = -goal_distance

        # Set the absorbing flag if goal is reached
        absorbing = goal_distance < self._goal_radius

        # Return a copy of the state (see reset()) + empty dictionary (used to pass additional information)
        return self._state.copy(), reward, absorbing, {}

Warning

reset/step must never return a reference to an internal buffer that your environment mutates in place on a later call. Core keeps the array returned by one call as the state for the next call, and also stores it in the Dataset. If you keep a persistent buffer for performance (e.g. to avoid reallocating an observation array every step, or when wrapping a simulator that reuses its own buffer) and hand it out by reference, mutating it on the following step silently corrupts the transition you already returned. Always return a copy of any internally mutated array, e.g. self._state.copy() for NumPy, self._state.clone() for PyTorch. MuJoCo, PyBullet and MultiprocessEnvironment follow this pattern and are good references. The same rule applies to reset_all/step_all of a VectorizedEnvironment.

Finally, we implement the render function using our Viewer class. This class wraps Pygame to provide an easy visualization tool for 2D Reinforcement Learning algorithms. The viewer class has many functionalities, but here we simply draw two circles representing the agent and the goal area:

    def render(self, record=False):
        # Draw a red circle for the agent
        self._viewer.circle(self._state, 0.1, color=(255, 0, 0))

        # Draw a green circle for the goal
        self._viewer.circle(self._goal, self._goal_radius, color=(0, 255, 0))

        # Get the image if the record flag is set to true
        frame = self._viewer.get_frame() if record else None

        # Display the image for the control time (0.1 seconds)
        self._viewer.display(self.info.dt)

        return frame

For more information about the viewer, refer to the class documentation.

To conclude our environment, it’s also possible to register it as specified in the previous section of this tutorial:

# Register the class
RoomToyEnv.register()

Learning in the toy environment

Now that we have created our environment, we try to solve it using Reinforcement Learning. The following code uses the True Online SARSA-Lambda algorithm, exploiting a tiles approximator.

We first import all necessary classes and utilities, then we construct the environment (we set the seed for reproducibility).

if __name__ == '__main__':
    from mushroom_rl.core import Core
    from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
    from mushroom_rl.policy import EpsGreedy
    from mushroom_rl.features import Features
    from mushroom_rl.features.tiles import Tiles
    from mushroom_rl.rl_utils.parameters import Parameter

    # Set the seed
    np.random.seed(1)

    # Create the toy environment with default parameters
    env = Environment.make('RoomToyEnv')

We now proceed then to create the agent policy, which is a linear policy using tiles features, similar to the one used by the Mountain Car experiment from R. Sutton book.

    epsilon = Parameter(value=0.1)
    pi = EpsGreedy(epsilon=epsilon)

    # Creating a simple agent using linear approximator with tiles
    n_tilings = 5
    tilings = Tiles.generate(n_tilings, [10, 10],
                             env.info.observation_space.low,
                             env.info.observation_space.high)
    features = Features(tilings=tilings)

    learning_rate = Parameter(.1 / n_tilings)

    approximator_params = dict(input_shape=(features.size,),
                               output_shape=(env.info.action_space.n,),
                               n_actions=env.info.action_space.n,
                               phi=features)

    agent = TrueOnlineSARSALambda(env.info, pi,
                                  approximator_params=approximator_params,
                                  learning_rate=learning_rate,
                                  lambda_coeff=.9)

Finally, using the Core class we set up an RL experiment. We first evaluate the initial policy for three episodes on the environment. Then we learn the task using the algorithm build above for 20000 steps. In the end, we evaluate the learned policy for 3 more episodes.

    core = Core(agent, env)

    # Visualize initial policy for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value before learning
    J = np.mean(dataset.discounted_return)
    print(f'Objective function before learning: {J}')

    # Train
    core.learn(n_steps=20000, n_steps_per_fit=1, render=False)

    # Visualize results for 3 episodes
    dataset = core.evaluate(n_episodes=3, render=True)

    # Print the average objective value after learning
    J = np.mean(dataset.discounted_return)
    print(f'Objective function after learning: {J}')

Vectorized environments

Some environments can step many copies of the same problem in parallel, which lets the agent collect samples much faster. This is common when the simulator is natively batched, for example GPU-based simulators that evolve a whole batch of states at once.

The vectorized interface

Such environments extend the VectorizedEnvironment interface instead of Environment. The constructor takes the usual MDPInfo together with the number of parallel copies n_envs, and instead of the single-environment reset, step and render methods you implement their batched counterparts:

reset_all(env_mask, state=None): reset the selected environments to their initial state, returning the batched initial states and a list of episode-info dictionaries;
step_all(env_mask, action): apply a batch of actions to the selected environments, returning the batched next states, rewards, absorbing flags and a list of step-info dictionaries;
render_all(env_mask, record=False): render the selected environments.

The recurring argument is the env_mask: a boolean array of length n_envs that selects which copies the operation applies to. This is what makes parallel collection efficient — the copies run independent episodes that terminate at different times, so the Core only resets the ones that have just finished while the others keep stepping, rather than restarting the whole batch in lockstep.

A VectorizedEnvironment is also a valid single environment: the base class implements reset, step and render by forwarding to the batched methods on a single default copy, which you can select with set_default_env. This is mostly useful for debugging or for rendering one copy of the batch.

Parallelizing a standard environment

You do not need a natively batched simulator to benefit from this: any standard environment can be parallelized across processes with MultiprocessEnvironment, which wraps several copies of it into a single VectorizedEnvironment:

from mushroom_rl.core import MultiprocessEnvironment
from mushroom_rl.environments import Gymnasium

env = MultiprocessEnvironment(Gymnasium, 'Pendulum-v1', horizon=200, gamma=.99, n_envs=15)

MultiprocessEnvironment takes the environment class followed by the arguments of its constructor, plus the number of parallel copies n_envs.

Running the experiment

You do not need to handle the batching yourself when running experiments: the Core recognizes a vectorized environment and runs the appropriate parallel collection loop internally. Your experiment script is unchanged — you build and use the Core exactly as before:

core = Core(agent, env)
core.learn(n_steps=30000, n_steps_per_fit=3000)