Reinforcement Learning

As we provide standard RL-like interfaces, one can implement any RL algorithm on top of it in a neat manner. Before algorithm implementation, it should be noted what problem is to be solved. To formulate desired problems, we follow the Formulation as Wrappers principle by encoding appropriate wrappers.

Tabular RL

Single-Agent Q Learning

Given a particular agent and the multi-agent environment, a single-agent RL problem can be induced. One can simply assume the others are not moving and compute an optimal policy for that particular agent.

from marp.rl import SingleAgentLearningWrapper, Qlearning

env.reset()
ctrl_agent = 'robot_1'
training_env = SingleAgentLearningWrapper(env, ctrl_agent)
policy = Qlearning(training_env)

observations, infos = env.reset()
while env.agents:
    a = policy[str(observations[ctrl_agent])] if str(observations[ctrl_agent]) in policy else 0
    actions = {
        agent: a if agent == ctrl_agent else 0
        for agent in env.agents
    }
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

  1. Formulate a single-agent RL problem from robot_1’s perspective by the wrapper SingleAgentLearningWrapper.

  2. Call Qlearning() to compute a policy.

  3. Execute the policy.

One will probably get similar visual results as follows. Note that we simple assume all other agents stay put.

_images/satql.gif

Multi-Agent Joint Q Learning

One can also view all agents as a joint one, to find an optimal policy for the joint agent is to find a policy for each agent as if they are all under centralized control.

from marp.rl import MultiAgentJointLearningWrapper, Qlearning

env.reset()
training_env = MultiAgentJointLearningWrapper(env)
ja = training_env.joint_actions
policy = Qlearning(training_env, num_it=3e4, alpha=0.8)

observations, infos = env.reset()
while env.agents:
    a = policy[str(observations[env.agents[0]])] if str(observations[env.agents[0]]) in policy else 0
    actions = dict(zip(env.agents, ja[a]))
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

  1. Formulate a multi-agent joint learning problem by the wrapper MultiAgentJointLearningWrapper.

  2. Again, call Qlearning() to compute a (joint) policy, since we are still dealing with one (joint) agent.

  3. Execute the policy.

One will probably get similar visual results as follows. Note that this time Qlearning() operates over joint state space and joint action space with certain aggregated rewards, and therefore, may not perform well if it is not trained for enough iterations.

_images/majql.gif

Multi-Agent Individual Q Learning

To achive centralized control, one can alternatively view each agent as an autonomous entity interacting with others. Each of them learns am individual policy even if the interaction is non-stationary.

from marp.rl import individualQlearning

env.reset()
policies = individualQlearning(env, num_it=5e4, epsilon=0.5, alpha=0.3)

observations, infos = env.reset()
while env.agents:
    actions = {}
    for agent in env.agents:
        if str(observations[agent]) not in policies[agent]:
            actions[agent] = 0
        else:
            actions[agent] = policies[agent][str(observations[agent])]
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

  1. Just take the raw multi-agent environment.

  2. Instead, call individualQlearning() to compute individual policies for each agent.

  3. Execute the policy profile.

Note that for each agent, her policy is a mapping from possible joint states to her own actions. One will probably get similar visual results as follows. As shown, agent 2 took greedy moves at first but then make a detour to avoid collision with agent 1.

_images/maiql.gif

Deep RL

In addition to conventional tabular RL implementations, one can also take one step further by integrating existing deep RL algorithms, e.g., DQN, A2C, and PPO.

We hereby also provide a tutorial by using Stable Baselines3 as a pool of Deep RL algorithms, and SuperSuit as useful environment wrappers for parallel training.

Single-Agent DRL

import time

import numpy as np
import supersuit as ss
from stable_baselines3 import DQN

from marp.rl import SingleAgentLearningWrapper

alg = DQN
policy_kwargs = {
    'net_arch': [8, 2],
}
agent = 'robot_0'

env.reset()
training_env = SingleAgentLearningWrapper(env, agent)
training_env = ss.stable_baselines3_vec_env_v0(training_env, num_envs=8)
training_env.reset()

model = alg("MlpPolicy", training_env,
            verbose=1,
            tau=0.5,
            exploration_fraction=0.5,
            batch_size=256,
            policy_kwargs=policy_kwargs,
            tensorboard_log="runs")
model.learn(total_timesteps=int(2.5e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save(f"pretrained/singleDQN_{agent}")

policy = alg.load(f"pretrained/singleDQN_{agent}.zip")
observations, infos = env.reset()
while env.agents:
    actions = {
        'robot_0': policy.predict(observations['robot_0'], deterministic=True)[0]
    }
    actions['robot_1'] = 0
    actions['robot_2'] = 0
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Multi-Agent Joint Learning

import time

import numpy as np
import supersuit as ss
from stable_baselines3 import PPO

from marp.rl import MultiAgentJointLearningWrapper

alg = PPO
policy_kwargs = {
    'net_arch': dict(pi=[16, 6], vf=[16, 6]),
}

training_env = MultiAgentJointLearningWrapper(env)
ja = training_env.joint_actions
training_env = ss.stable_baselines3_vec_env_v0(training_env, num_envs=8)

training_env.reset()
model = alg("MlpPolicy", training_env,
            verbose=1,
            batch_size=128,
            policy_kwargs=policy_kwargs,
            tensorboard_log="runs")
model.learn(total_timesteps=int(10e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save("pretrained/jointPPO")

model = alg.load("pretrained/jointPPO.zip")
observations, infos = env.reset()
while env.agents:
    a, _ = model.predict(observations[env.agents[0]])
    actions = ja[a]
    actions = dict(zip(env.agents, actions))
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Multi-Agent Individual Learning

import time

import supersuit as ss
from stable_baselines3 import A2C

alg = A2C

training_env = ss.pettingzoo_env_to_vec_env_v1(env)
training_env = ss.concat_vec_envs_v1(training_env, 8, num_cpus=1, base_class="stable_baselines3")

training_env.reset()
model = alg("MlpPolicy", training_env,
            verbose=1,
            tensorboard_log="runs")
model.learn(total_timesteps=int(8e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save("pretrained/indiA2C")

model = alg.load("pretrained/indiA2C.zip")
observations, infos = env.reset()
while env.agents:
    actions = {
        agent: model.predict(observations[agent], deterministic=True)[0]
        for agent in env.agents
    }
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Detailed Usage

Functions

marp.rl.Qlearning(env, num_it=1000.0, epsilon=0.5, alpha=0.3, gamma=0.9)

Tabular Q learning

Parameters:
  • env (RLEnv) – a single/joint-agent RL environment

  • num_it (int or float) – the number of learning iterations

  • epsilon (float) – the initial exploration rate, will linearly decay to 0.1 in the first half of iterations

  • alpha (float) – learning rate

  • gamma (float) – discount factor

Returns:

policy (dict) – the learned policy

marp.rl.individualQlearning(env, num_it=1000.0, epsilon=0.5, alpha=0.3, gamma=0.9)

Tabular Q learning for multiple individual learners

Parameters:
  • env (MARLEnv) – a multi-agent RL environment

  • num_it (int or float) – the number of learning iterations

  • epsilon (float) – the initial exploration rate, will linearly decay to 0.1 in the first half of iterations

  • alpha (float) – learning rate

  • gamma (float) – discount factor

Returns:

policies (dict[str, dict]) – a policy profile

Classes

class marp.rl.SingleAgentLearningWrapper(ma_env, agent)

Formulate a single agent learning problem

Parameters:
  • ma_env (MARP) – an intialized multi-agent environment

  • agent (str) – the agent the the problem is induced for

reset(seed=None, options=None)

Reset the location of the agent

step(action)

Proceed to the next step by the given action

Parameters:

action (Action) – the next action

Returns:
  • obs (dict) – local observation

  • reward (float) – reward

  • termination (bool) – whether the episode terminates

  • truncation (bool) – whether the maximum number of steps is exceeded

  • info (dict) – auxiliary infomation including collision situations and action masks

class marp.rl.MultiAgentJointLearningWrapper(ma_env)

Formulate a multi-agent joint learning problem

Parameters:

ma_env (MARP) – an intialized multi-agent environment

reset(seed=None, options=None)

Reset the locations of agents

step(action)

Proceed to the next step by the given joint action

Parameters:

action (Action) – the next joint action

Returns:
  • obs (dict) – joint observations

  • reward (float) – aggregated reward by simple summation

  • termination (bool) – whether the episode terminates for all agents

  • truncation (bool) – whether the maximum number of steps is exceeded

  • info (dict) – auxiliary infomation including collision situations and action masks