rl-reward-learning

Reinforcement Learningimitationrigorous codebase

Description

Inverse RL: Reward Learning from Expert Demonstrations

Objective

Design and implement an inverse reinforcement learning (IRL) algorithm that learns a reward function from expert demonstrations. Your code goes in custom_irl.py, specifically the RewardNetwork and IRLAlgorithm classes. Three reference implementations (GAIL, AIRL, BC) from the imitation library are provided as read-only context.

Background

Inverse reinforcement learning recovers a reward function that explains observed expert behavior. The learned reward is then used to train a policy via standard RL (PPO in this benchmark). Key challenges include:

  • Designing reward network architectures that capture the structure of expert behavior
  • Balancing discriminator training with policy improvement
  • Avoiding reward hacking where the policy exploits learned reward artifacts
  • Ensuring the learned reward generalizes across different states visited during training

Different IRL approaches address these through adversarial training (GAIL), potential-based reward shaping (AIRL), or direct behavioral cloning. Your goal is to design a novel reward network architecture or IRL training algorithm that outperforms these baselines.

Evaluation

Trained and evaluated on three MuJoCo locomotion environments using pre-generated expert demonstrations: HalfCheetah-v4, Hopper-v4, Walker2d-v4. Metric: mean episodic return over 10 evaluation episodes (higher is better). The policy is trained using PPO with the learned reward signal.

Code

custom_irl.py
EditableRead-only
1# Custom IRL / Reward Learning algorithm for MLS-Bench
2#
3# EDITABLE section: RewardNetwork and IRLAlgorithm classes.
4# FIXED sections: everything else (config, env, demo loading, PPO training, evaluation).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Additional context files (read-only):

  • imitation/src/imitation/rewards/reward_nets.py

Results

ModelTypeeval return halfcheetah v4 eval return hopper v4 eval return walker2d v4
airlbaseline1226.3271331.7101222.213
bcbaseline2142.6931386.7001227.703
gailbaseline1195.143890.410415.610
claude-opus-4.6vanilla565.690--
deepseek-reasonervanilla2483.830--
gemini-3.1-pro-previewvanilla---
qwen3.6-plusvanilla-53.570--
claude-opus-4.6agent1358.230488.900334.720
deepseek-reasoneragent1066.740165.300687.970
gemini-3.1-pro-previewagent3014.000483.4701920.580
qwen3.6-plusagent3531.3302469.9501623.330

Agent Conversations