marl-mixing-network

Otherepymarlrigorous codebase

Description

Cooperative MARL: Value Decomposition Mixing Network

Objective

Improve cooperative multi-agent reinforcement learning by designing a better value decomposition mixing network. You can modify the CustomMixer class (lines 13-49) and add custom imports (lines 7-8) in custom.py.

Background

In cooperative MARL, agents share a common reward but each agent has only a partial observation. Value decomposition methods learn individual agent Q-values and combine them into a joint Q_tot using a mixing network. The quality of this mixing network directly determines how well individual agents can coordinate.

The training uses EPyMARL with Q-learning on three PettingZoo MPE cooperative tasks:

  • simple_spread: 3 agents must spread out to cover 3 landmarks while avoiding collisions.
  • simple_tag: 3 predator agents cooperate to catch a pretrained prey controlled by a fixed DDPG policy.
  • simple_speaker_listener: a speaker and listener must coordinate through communication to reach the correct target.

The default mixer is a simple learnable weighted sum that does not condition on the global state. Each setup trains for 2M environment timesteps with epsilon-greedy exploration.

Interface

Your CustomMixer must:

  • Inherit from nn.Module
  • Accept args in __init__ with attributes: n_agents, state_shape, mixing_embed_dim
  • Implement forward(self, agent_qs, states) where:
    • agent_qs: shape (batch, T, n_agents) — individual agent Q-values
    • states: shape (batch, T, state_dim) — global state information
    • Returns q_tot: shape (batch, T, 1) — joint action value

Reference Implementations

  • VDN (vdn.py): Simple sum, Q_tot = sum(Q_i). No parameters, no state conditioning.
  • LinearMixer: Learnable weighted sum with a bias term. State-agnostic but more flexible than VDN.
  • QMIX (qmix.py): Uses hypernetworks conditioned on global state to generate mixing weights. Enforces monotonicity via absolute value on weights.

Evaluation

Final performance is measured by mean episode return over 32 test episodes with greedy policy, evaluated separately on all three setups and recorded to the leaderboard under setup-specific metric keys.

Code

custom.py
EditableRead-only
1import numpy as np
2import torch as th
3import torch.nn as nn
4import torch.nn.functional as F
5
6
7# ── Custom imports (editable) ────────────────────────────────────────────
8
9
10# ======================================================================
11# EDITABLE — Custom mixing network
12# ======================================================================
13class CustomMixer(nn.Module):
14 """Custom mixing network for cooperative MARL value decomposition.
15

Additional context files (read-only):

  • epymarl/src/modules/mixers/vdn.py
  • epymarl/src/modules/mixers/qmix.py
  • epymarl/src/learners/q_learner.py

Results

No results yet.