rl-reward-learning

Reinforcement Learningimitationrigorous codebase

Description

Inverse RL: Reward Learning from Expert Demonstrations

Objective

Design and implement an inverse reinforcement learning (IRL) algorithm that learns a reward function from expert demonstrations. Your code goes in custom_irl.py, specifically the RewardNetwork and IRLAlgorithm classes. Three reference implementations (GAIL, AIRL, BC) from the imitation library are provided as read-only context.

Background

Inverse reinforcement learning recovers a reward function that explains observed expert behavior. The learned reward is then used to train a policy via standard RL (PPO in this benchmark). Key challenges include:

Designing reward network architectures that capture the structure of expert behavior
Balancing discriminator training with policy improvement
Avoiding reward hacking where the policy exploits learned reward artifacts
Ensuring the learned reward generalizes across different states visited during training

Different IRL approaches address these through adversarial training (GAIL), potential-based reward shaping (AIRL), or direct behavioral cloning. Your goal is to design a novel reward network architecture or IRL training algorithm that outperforms these baselines.

Evaluation

Trained and evaluated on three MuJoCo locomotion environments using pre-generated expert demonstrations: HalfCheetah-v4, Hopper-v4, Walker2d-v4. Metric: mean episodic return over 10 evaluation episodes (higher is better). The policy is trained using PPO with the learned reward signal.

Code

custom_irl.py

EditableRead-only

1# Custom IRL / Reward Learning algorithm for MLS-Bench
2#
3# EDITABLE section: RewardNetwork and IRLAlgorithm classes.
4# FIXED sections: everything else (config, env, demo loading, PPO training, evaluation).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Additional context files (read-only):

imitation/src/imitation/rewards/reward_nets.py

Results

Show per-seed results

Model	Type	eval return halfcheetah v4 ↑	eval return hopper v4 ↑	eval return walker2d v4 ↑
airl	baseline	1226.327	1331.710	1222.213
bc	baseline	2142.693	1386.700	1227.703
gail	baseline	1195.143	890.410	415.610
claude-opus-4.6	vanilla	565.690	-	-
deepseek-reasoner	vanilla	2483.830	-	-
gemini-3.1-pro-preview	vanilla	-	-	-
qwen3.6-plus	vanilla	-53.570	-	-
claude-opus-4.6	agent	1358.230	488.900	334.720
deepseek-reasoner	agent	1066.740	165.300	687.970
gemini-3.1-pro-preview	agent	3014.000	483.470	1920.580
qwen3.6-plus	agent	3531.330	2469.950	1623.330

Agent Conversations

gemini-3.1-pro-preview

8 steps

qwen3.6-plus

17 steps