A Case for a More General Reinforcement Learning Environment API

Brandyn Kusenda
2 min readJan 23, 2021

The OpenAI gym API was designed with only single-agent interaction in mind. This wouldn’t be a problem in and of itself since wrappers can be used to modify the environment API. However, the result is that many RL algorithms have been designed around the single-agent interface perhaps with support for vectorized environments. This is unfortunate because a single-agent environment with or without vectorization is just a special case of a multi-agent environment. Even vectorized multi-agent environments can be modeled as a single multi-agent environment.

PettingZoo and Open Spiel both have very flexible Multi-agent environment APIs but aren’t ideal for vectorized or parallel environments. RLlib, on the other hand, uses dictionaries to map action and observation information to agents. With this interface, we can handle single-agent turn-based, multi-agent turn-based, or parallel agent interactions. Ensuring agent order can be handled in several ways with the simplest being: only agents that receive an observation take an action. It would be the algorithm writer’s responsibility to ensure this contract is followed. One obvious advantage of dictionaries over lists or tensors is that we can easily omit information and still be mapped back to the correct agent when tracking state.

This would lead one to wonder “how are environment resets handled?” RLlib does this by providing a special key in the “dones” dictionary called “__all__” to tell the RL algorithm the episode is complete, and it is up to the algorithm to call reset at this time. This works fine for multi-agent environments but is not optimal for vectorized environments. My thought on this is, “why not just automatically reset inside the environment?” We would still keep the reset() function for providing a way to trigger a full reset, but from the algorithm implementer’s perspective, we have all the information we need when we receive the dictionary of “dones” to let us know when an episode is complete.

A downside to this approach is that it adds some complexity to both the environment and algorithm developer. There may also be a performance hit when switching from vectorized environments to dictionaries. Other than that, I’m not aware of any other major limitations.

The RLlib MultiAgentEnv API is closest to what I described but has a limitation on the number of observation/action spaces and doesn’t auto reset: https://docs.ray.io/en/master/_modules/ray/rllib/env/multi_agent_env.html

[edit] PettingZoo also includes a Parallel version of their API, which is similar to RLlib’s and what I’ve described. Their library also provides wrappers to allow easy conversion between the AEC API and Parallel API. For more details see: https://www.pettingzoo.ml/api

--

--