Reinforcement Learning¶

As we provide standard RL-like interfaces, one can implement any RL algorithm on top of it in a neat manner. Before algorithm implementation, it should be noted what problem is to be solved. To formulate desired problems, we follow the Formulation as Wrappers principle by encoding appropriate wrappers.

Tabular RL¶

Single-Agent Q Learning¶

Given a particular agent and the multi-agent environment, a single-agent RL problem can be induced. One can simply assume the others are not moving and compute an optimal policy for that particular agent.

from marp.rl import SingleAgentLearningWrapper, Qlearning

env.reset()
ctrl_agent = 'robot_1'
training_env = SingleAgentLearningWrapper(env, ctrl_agent)
policy = Qlearning(training_env)

observations, infos = env.reset()
while env.agents:
    a = policy[str(observations[ctrl_agent])] if str(observations[ctrl_agent]) in policy else 0
    actions = {
        agent: a if agent == ctrl_agent else 0
        for agent in env.agents
    }
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

Formulate a single-agent RL problem from robot_1’s perspective by the wrapper SingleAgentLearningWrapper.
Call Qlearning() to compute a policy.
Execute the policy.

One will probably get similar visual results as follows. Note that we simple assume all other agents stay put.

Multi-Agent Joint Q Learning¶

One can also view all agents as a joint one, to find an optimal policy for the joint agent is to find a policy for each agent as if they are all under centralized control.

from marp.rl import MultiAgentJointLearningWrapper, Qlearning

env.reset()
training_env = MultiAgentJointLearningWrapper(env)
ja = training_env.joint_actions
policy = Qlearning(training_env, num_it=3e4, alpha=0.8)

observations, infos = env.reset()
while env.agents:
    a = policy[str(observations[env.agents[0]])] if str(observations[env.agents[0]]) in policy else 0
    actions = dict(zip(env.agents, ja[a]))
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

Formulate a multi-agent joint learning problem by the wrapper MultiAgentJointLearningWrapper.
Again, call Qlearning() to compute a (joint) policy, since we are still dealing with one (joint) agent.
Execute the policy.

One will probably get similar visual results as follows. Note that this time Qlearning() operates over joint state space and joint action space with certain aggregated rewards, and therefore, may not perform well if it is not trained for enough iterations.

Multi-Agent Individual Q Learning¶

To achive centralized control, one can alternatively view each agent as an autonomous entity interacting with others. Each of them learns am individual policy even if the interaction is non-stationary.

from marp.rl import individualQlearning

env.reset()
policies = individualQlearning(env, num_it=5e4, epsilon=0.5, alpha=0.3)

observations, infos = env.reset()
while env.agents:
    actions = {}
    for agent in env.agents:
        if str(observations[agent]) not in policies[agent]:
            actions[agent] = 0
        else:
            actions[agent] = policies[agent][str(observations[agent])]
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

The above code has basically done the following things:

Just take the raw multi-agent environment.
Instead, call individualQlearning() to compute individual policies for each agent.
Execute the policy profile.

Note that for each agent, her policy is a mapping from possible joint states to her own actions. One will probably get similar visual results as follows. As shown, agent 2 took greedy moves at first but then make a detour to avoid collision with agent 1.

Deep RL¶

In addition to conventional tabular RL implementations, one can also take one step further by integrating existing deep RL algorithms, e.g., DQN, A2C, and PPO.

We hereby also provide a tutorial by using Stable Baselines3 as a pool of Deep RL algorithms, and SuperSuit as useful environment wrappers for parallel training.

Single-Agent DRL¶

import time

import numpy as np
import supersuit as ss
from stable_baselines3 import DQN

from marp.rl import SingleAgentLearningWrapper

alg = DQN
policy_kwargs = {
    'net_arch': [8, 2],
}
agent = 'robot_0'

env.reset()
training_env = SingleAgentLearningWrapper(env, agent)
training_env = ss.stable_baselines3_vec_env_v0(training_env, num_envs=8)
training_env.reset()

model = alg("MlpPolicy", training_env,
            verbose=1,
            tau=0.5,
            exploration_fraction=0.5,
            batch_size=256,
            policy_kwargs=policy_kwargs,
            tensorboard_log="runs")
model.learn(total_timesteps=int(2.5e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save(f"pretrained/singleDQN_{agent}")

policy = alg.load(f"pretrained/singleDQN_{agent}.zip")
observations, infos = env.reset()
while env.agents:
    actions = {
        'robot_0': policy.predict(observations['robot_0'], deterministic=True)[0]
    }
    actions['robot_1'] = 0
    actions['robot_2'] = 0
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Multi-Agent Joint Learning¶

import time

import numpy as np
import supersuit as ss
from stable_baselines3 import PPO

from marp.rl import MultiAgentJointLearningWrapper

alg = PPO
policy_kwargs = {
    'net_arch': dict(pi=[16, 6], vf=[16, 6]),
}

training_env = MultiAgentJointLearningWrapper(env)
ja = training_env.joint_actions
training_env = ss.stable_baselines3_vec_env_v0(training_env, num_envs=8)

training_env.reset()
model = alg("MlpPolicy", training_env,
            verbose=1,
            batch_size=128,
            policy_kwargs=policy_kwargs,
            tensorboard_log="runs")
model.learn(total_timesteps=int(10e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save("pretrained/jointPPO")

model = alg.load("pretrained/jointPPO.zip")
observations, infos = env.reset()
while env.agents:
    a, _ = model.predict(observations[env.agents[0]])
    actions = ja[a]
    actions = dict(zip(env.agents, actions))
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Multi-Agent Individual Learning¶

import time

import supersuit as ss
from stable_baselines3 import A2C

alg = A2C

training_env = ss.pettingzoo_env_to_vec_env_v1(env)
training_env = ss.concat_vec_envs_v1(training_env, 8, num_cpus=1, base_class="stable_baselines3")

training_env.reset()
model = alg("MlpPolicy", training_env,
            verbose=1,
            tensorboard_log="runs")
model.learn(total_timesteps=int(8e6),
            tb_log_name=f"{time.strftime('%Y-%m-%d-%H%M%S', time.localtime())}")
model.save("pretrained/indiA2C")

model = alg.load("pretrained/indiA2C.zip")
observations, infos = env.reset()
while env.agents:
    actions = {
        agent: model.predict(observations[agent], deterministic=True)[0]
        for agent in env.agents
    }
    observations, rewards, terminations, truncations, infos = env.step(actions)
env.render()

Detailed Usage¶

Functions¶

marp.rl.Qlearning(env, num_it=1000.0, epsilon=0.5, alpha=0.3, gamma=0.9)¶

Tabular Q learning

Parameters:

env (RLEnv) – a single/joint-agent RL environment
num_it (int or float) – the number of learning iterations
epsilon (float) – the initial exploration rate, will linearly decay to 0.1 in the first half of iterations
alpha (float) – learning rate
gamma (float) – discount factor

Returns:

policy (dict) – the learned policy

marp.rl.individualQlearning(env, num_it=1000.0, epsilon=0.5, alpha=0.3, gamma=0.9)¶

Tabular Q learning for multiple individual learners

Parameters:

env (MARLEnv) – a multi-agent RL environment
num_it (int or float) – the number of learning iterations
epsilon (float) – the initial exploration rate, will linearly decay to 0.1 in the first half of iterations
alpha (float) – learning rate
gamma (float) – discount factor

Returns:

policies (dict[str, dict]) – a policy profile

Classes¶

class marp.rl.SingleAgentLearningWrapper(ma_env, agent)¶

Formulate a single agent learning problem

Parameters:

ma_env (MARP) – an intialized multi-agent environment
agent (str) – the agent the the problem is induced for

reset(seed=None, options=None)¶: Reset the location of the agent

step(action)¶

Proceed to the next step by the given action

Parameters:

action (Action) – the next action

Returns:

obs (dict) – local observation
reward (float) – reward
termination (bool) – whether the episode terminates
truncation (bool) – whether the maximum number of steps is exceeded
info (dict) – auxiliary infomation including collision situations and action masks

class marp.rl.MultiAgentJointLearningWrapper(ma_env)¶

Formulate a multi-agent joint learning problem

Parameters:: ma_env (MARP) – an intialized multi-agent environment

reset(seed=None, options=None)¶: Reset the locations of agents

step(action)¶

Proceed to the next step by the given joint action

Parameters:

action (Action) – the next joint action

Returns:

obs (dict) – joint observations
reward (float) – aggregated reward by simple summation
termination (bool) – whether the episode terminates for all agents
truncation (bool) – whether the maximum number of steps is exceeded
info (dict) – auxiliary infomation including collision situations and action masks