marl-centralized-critic

Otherepymarlrigorous codebase

Description

Cooperative MARL: Centralized Critic Architecture for MAPPO

Objective

Improve cooperative multi-agent reinforcement learning by designing a better centralized critic architecture for MAPPO (Multi-Agent PPO). You can modify the CustomCritic class (lines 13-69) and add custom imports (lines 7-8) in custom_critic.py.

Background

In cooperative MARL with partial observability, each agent only sees a local observation but the team shares a common reward. Centralized-Training-with-Decentralized-Execution (CTDE) methods train a centralized value function during training (which can see the global state and all agents' information) and use it to reduce variance when computing advantages for each agent's decentralized policy gradient update. The architecture of this centralized critic — what it conditions on and how it mixes per-agent features — directly determines how tight the bias-variance tradeoff is and therefore how well MAPPO scales to hard multi-agent cooperation tasks.

The training uses EPyMARL's ppo_learner with the MAPPO default hyperparameters on three SMAC maps via smaclite (a pure-Python reimplementation of the StarCraft Multi-Agent Challenge benchmark — no StarCraft II binary required):

  • mmm — 1 Medivac + 2 Marauders + 7 Marines (team of 10) vs mirror; heterogeneous cooperation requiring heal micro (≈5M env steps).
  • 2s3z — 2 Stalkers + 3 Zealots (team of 5) vs mirror; medium heterogeneous team (≈5M env steps).
  • 3s5z — 3 Stalkers + 5 Zealots (team of 8) vs mirror; larger team, harder (≈5M env steps).

The default critic is a simple (state ⊕ agent-one-hot) → 3-layer MLP → V that ignores per-agent observations. It is a working baseline but leaves room for smarter architectures that integrate per-agent features, attention, or state conditioning.

Interface

Your CustomCritic must:

  • Inherit from nn.Module.
  • Accept (scheme, args) in __init__, where:
    • scheme["state"]["vshape"] — global state dim
    • scheme["obs"]["vshape"] — per-agent observation dim
    • args.n_agents, args.n_actions, args.hidden_dim, args.obs_last_action, args.obs_individual_obs
  • Set self.output_type = "v" in __init__.
  • Implement forward(self, batch, t=None) where:
    • batch["state"] — shape (B, T, state_dim)
    • batch["obs"] — shape (B, T, n_agents, obs_dim)
    • batch.batch_size, batch.max_seq_length, batch.device
    • t=None means "whole sequence"; otherwise t is an integer
    • Returns q with shape (B, T, n_agents, 1) — the learner later does .squeeze(3), so the trailing singleton is mandatory.

Reference Implementations

  • IPPO critic (ippo_critic.edit.py): per-agent MLP over batch["obs"] ⊕ agent-one-hot; no centralization. Floor baseline from Yu et al. 2022's IPPO ablation. Also see epymarl/src/modules/critics/ac.py.
  • MAPPO critic (mappo_critic.edit.py): shared MLP over (batch["state"] ⊕ agent-one-hot). Standard MAPPO central V from Yu et al. 2022. Also see epymarl/src/modules/critics/centralV.py.
  • MAT-style attention critic (mat_critic.edit.py): projects per-agent features (obs ⊕ state broadcast) into tokens, then a single TransformerEncoder layer with self-attention across the agent axis produces a per-agent value. Adapted from Wen et al. 2022 "Multi-Agent Transformer" (arXiv 2205.14953) — critic-only form; the MAPPO actor is kept unchanged.

Evaluation

Final performance is measured by test win rate (battle_won_mean) averaged over 32 test episodes with the greedy policy, evaluated separately on all three SMAC maps and recorded to the leaderboard under setup-specific metric keys:

  • Primary: test_battle_won_mean_<map>
  • Secondary: test_return_mean_<map>

Higher is better. A strong centralized critic should generalize across maps of varying difficulty.

Code

custom_critic.py
EditableRead-only
1import numpy as np
2import torch as th
3import torch.nn as nn
4import torch.nn.functional as F
5
6
7# ── Custom imports (editable) ────────────────────────────────────────────
8
9
10# ======================================================================
11# EDITABLE — Custom centralized critic for MAPPO
12# ======================================================================
13class CustomCritic(nn.Module):
14 """Centralized critic for MAPPO on SMAC (via smaclite).
15

Additional context files (read-only):

  • epymarl/src/modules/critics/centralV.py
  • epymarl/src/modules/critics/ac.py
  • epymarl/src/learners/ppo_learner.py

Results

ModelTypetest return mean mmm test return std mmm test battle won mean mmm test return mean 2s3z test return std 2s3z test battle won mean 2s3z test return mean 3s5z test return std 3s5z test battle won mean 3s5z test return mean 2s vs 1sc test return std 2s vs 1sc test battle won mean 2s vs 1sc
ippo_criticbaseline21.7254.1440.63515.7492.8850.44818.1882.6030.635---
ippo_criticbaseline---13.8493.2670.27115.4233.2660.3549.9751.1130.010
ippo_criticbaseline---15.7492.8850.44818.1882.6030.6359.9751.1130.010
ippo_criticbaseline---4.4761.0060.0004.6450.8450.000---
ippo_criticbaseline21.7254.1440.635---------
mappo_criticbaseline22.8191.8460.92719.1062.0190.83318.6182.4350.740---
mappo_criticbaseline---19.0682.2120.83317.2422.7210.46910.0330.0360.000
mappo_criticbaseline---19.1062.0190.83318.6182.4350.74010.0330.0360.000
mappo_criticbaseline---4.4611.1160.0004.5950.8830.000---
mappo_criticbaseline22.8191.8460.927---------
mat_criticbaseline18.9481.8540.13514.1822.7070.54215.1291.6180.115---
mat_criticbaseline---13.8003.0900.50013.6001.1550.00012.9781.4170.281
mat_criticbaseline---14.1822.7070.54215.1291.6180.11512.9781.4170.281
mat_criticbaseline---4.4870.9990.0004.5360.8740.000---
mat_criticbaseline18.9481.8540.135---------
claude-opus-4.6vanilla22.6022.9760.90610.3281.1260.0007.5851.0390.000---
deepseek-reasonervanilla22.7361.9100.93818.0243.0280.68818.9182.2070.781---
gemini-3.1-pro-previewvanilla22.5152.1720.87519.0612.2180.84418.0082.8310.656---
qwen3.6-plusvanilla------------
claude-opus-4.6agent23.4141.2980.96919.1782.8950.90617.0993.2550.500---
deepseek-reasoneragent22.7361.9100.93818.0243.0280.68818.9182.2070.781---
gemini-3.1-pro-previewagent23.0931.8800.96919.5181.6580.90616.7822.8130.375---
qwen3.6-plusagent23.0390.5671.00019.1642.6140.90619.1372.1140.844---

Agent Conversations