rl-reward-learning
Description
Inverse RL: Reward Learning from Expert Demonstrations
Objective
Design and implement an inverse reinforcement learning (IRL) algorithm that learns a reward function from expert demonstrations. Your code goes in custom_irl.py, specifically the RewardNetwork and IRLAlgorithm classes. Three reference implementations (GAIL, AIRL, BC) from the imitation library are provided as read-only context.
Background
Inverse reinforcement learning recovers a reward function that explains observed expert behavior. The learned reward is then used to train a policy via standard RL (PPO in this benchmark). Key challenges include:
- Designing reward network architectures that capture the structure of expert behavior
- Balancing discriminator training with policy improvement
- Avoiding reward hacking where the policy exploits learned reward artifacts
- Ensuring the learned reward generalizes across different states visited during training
Different IRL approaches address these through adversarial training (GAIL), potential-based reward shaping (AIRL), or direct behavioral cloning. Your goal is to design a novel reward network architecture or IRL training algorithm that outperforms these baselines.
Evaluation
Trained and evaluated on three MuJoCo locomotion environments using pre-generated expert demonstrations: HalfCheetah-v4, Hopper-v4, Walker2d-v4. Metric: mean episodic return over 10 evaluation episodes (higher is better). The policy is trained using PPO with the learned reward signal.
Code
1# Custom IRL / Reward Learning algorithm for MLS-Bench2#3# EDITABLE section: RewardNetwork and IRLAlgorithm classes.4# FIXED sections: everything else (config, env, demo loading, PPO training, evaluation).5import os6import random7import time8from dataclasses import dataclass910import gymnasium as gym11import numpy as np12import torch13import torch.nn as nn14import torch.nn.functional as F15import torch.optim as optim
Additional context files (read-only):
imitation/src/imitation/rewards/reward_nets.py
Results
| Model | Type | eval return halfcheetah v4 ↑ | eval return hopper v4 ↑ | eval return walker2d v4 ↑ |
|---|---|---|---|---|
| airl | baseline | 1226.327 | 1331.710 | 1222.213 |
| bc | baseline | 2142.693 | 1386.700 | 1227.703 |
| gail | baseline | 1195.143 | 890.410 | 415.610 |
| claude-opus-4.6 | vanilla | 565.690 | - | - |
| deepseek-reasoner | vanilla | 2483.830 | - | - |
| gemini-3.1-pro-preview | vanilla | - | - | - |
| qwen3.6-plus | vanilla | -53.570 | - | - |
| claude-opus-4.6 | agent | 1358.230 | 488.900 | 334.720 |
| deepseek-reasoner | agent | 1066.740 | 165.300 | 687.970 |
| gemini-3.1-pro-preview | agent | 3014.000 | 483.470 | 1920.580 |
| qwen3.6-plus | agent | 3531.330 | 2469.950 | 1623.330 |