rl-intrinsic-exploration
Description
RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design
Research Question
Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.
Background
In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.
This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.
Reference algorithm families include:
- No bonus / vanilla PPO: learns only from clipped extrinsic reward
- RND: rewards prediction error against a fixed random target network
- ICM: rewards forward-dynamics prediction error in learned feature space
Task
Modify the editable section of custom_intrinsic_exploration.py:
IntrinsicBonusModule: define how intrinsic rewards are computed and trainedmix_advantages(...): define how extrinsic and intrinsic advantages are combined
The editable code must keep the public interface intact:
initialize(envs)trainable_parameters()update_batch_stats(batch_obs, batch_next_obs)compute_bonus(obs, next_obs, actions)normalize_rollout_rewards(rollout_intrinsic)loss(batch_obs, batch_next_obs, batch_actions)mix_advantages(ext_advantages, int_advantages, args)
Evaluation
The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:
- Tutankham-v5
- Frostbite-v5
- PrivateEye-v5
Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.
Metrics:
eval_return: mean evaluation episodic return at a fixed training budgetauc: area under the evaluation-return curve across trainingnonzero_rate: fraction of evaluation episodes with non-zero episodic return
Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.
Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.
Code
1# Custom sparse-reward Atari exploration benchmark for MLS-Bench.2#3# FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,4# evaluation, logging, and optimizer wiring.5# EDITABLE section: IntrinsicBonusModule + mix_advantages.67from __future__ import annotations89import os10import random11import time12from collections import deque13from dataclasses import dataclass1415import envpool
Additional context files (read-only):
cleanrl/cleanrl/ppo_rnd_envpool.pycleanrl/cleanrl/ppo_atari_envpool.py
Results
| Model | Type | eval return private eye v5 ↑ | auc private eye v5 ↑ | nonzero rate private eye v5 ↑ | best eval return private eye v5 ↑ | eval return tutankham v5 ↑ | auc tutankham v5 ↑ | nonzero rate tutankham v5 ↑ | best eval return tutankham v5 ↑ |
|---|---|---|---|---|---|---|---|---|---|
| icm | baseline | 0.000 | -29.291 | 0.000 | 0.000 | 109.000 | 86.635 | 1.000 | 111.000 |
| icm | baseline | 0.000 | -29.291 | 0.000 | 0.000 | 109.000 | 86.635 | 1.000 | 111.000 |
| icm | baseline | - | - | - | - | - | - | - | - |
| ppo | baseline | -300.000 | -240.105 | 0.667 | 33.333 | 36.533 | 32.367 | 0.333 | 37.800 |
| ppo | baseline | -300.000 | -240.105 | 0.667 | 33.333 | 36.533 | 32.367 | 0.333 | 37.800 |
| ppo | baseline | - | - | - | - | - | - | - | - |
| rnd | baseline | 952.000 | -84.230 | 0.667 | 1400.000 | 100.333 | 68.382 | 0.933 | 109.133 |
| rnd | baseline | 952.000 | -84.230 | 0.667 | 1400.000 | 100.333 | 68.382 | 0.933 | 109.133 |
| rnd | baseline | - | - | - | - | - | - | - | - |
| claude-opus-4.6 | vanilla | - | - | - | - | - | - | - | - |
| deepseek-reasoner | vanilla | - | - | - | - | - | - | - | - |
| deepseek-reasoner | vanilla | - | - | - | - | - | - | - | - |
| deepseek-reasoner | vanilla | - | - | - | - | - | - | - | - |
| gemini-3.1-pro-preview | vanilla | - | - | - | - | - | - | - | - |
| qwen3.6-plus | vanilla | - | - | - | - | - | - | - | - |
| qwen3.6-plus | vanilla | - | - | - | - | - | - | - | - |
| qwen3.6-plus | vanilla | - | - | - | - | - | - | - | - |
| claude-opus-4.6 | agent | 0.000 | -292.920 | 0.000 | 100.000 | 104.600 | 31.850 | 1.000 | 105.000 |
| deepseek-reasoner | agent | - | - | - | - | - | - | - | - |
| deepseek-reasoner | agent | - | - | - | - | - | - | - | - |
| deepseek-reasoner | agent | -22.400 | -610.786 | 0.200 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| deepseek-reasoner | agent | - | - | - | - | - | - | - | - |
| gemini-3.1-pro-preview | agent | 100.000 | 16.419 | 1.000 | 1020.000 | 114.600 | 104.677 | 1.000 | 118.200 |
| qwen3.6-plus | agent | - | - | - | - | - | - | - | - |
| qwen3.6-plus | agent | - | - | - | - | - | - | - | - |
| qwen3.6-plus | agent | 100.000 | -137.716 | 1.000 | 100.000 | 107.200 | 92.811 | 1.000 | 114.000 |
| qwen3.6-plus | agent | - | - | - | - | - | - | - | - |