rl-intrinsic-exploration

Reinforcement Learningcleanrlrigorous codebase

Description

RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design

Research Question

Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.

Background

In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.

This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.

Reference algorithm families include:

  • No bonus / vanilla PPO: learns only from clipped extrinsic reward
  • RND: rewards prediction error against a fixed random target network
  • ICM: rewards forward-dynamics prediction error in learned feature space

Task

Modify the editable section of custom_intrinsic_exploration.py:

  • IntrinsicBonusModule: define how intrinsic rewards are computed and trained
  • mix_advantages(...): define how extrinsic and intrinsic advantages are combined

The editable code must keep the public interface intact:

  • initialize(envs)
  • trainable_parameters()
  • update_batch_stats(batch_obs, batch_next_obs)
  • compute_bonus(obs, next_obs, actions)
  • normalize_rollout_rewards(rollout_intrinsic)
  • loss(batch_obs, batch_next_obs, batch_actions)
  • mix_advantages(ext_advantages, int_advantages, args)

Evaluation

The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:

  • Tutankham-v5
  • Frostbite-v5
  • PrivateEye-v5

Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.

Metrics:

  • eval_return: mean evaluation episodic return at a fixed training budget
  • auc: area under the evaluation-return curve across training
  • nonzero_rate: fraction of evaluation episodes with non-zero episodic return

Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.

Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.

Code

custom_intrinsic_exploration.py
EditableRead-only
1# Custom sparse-reward Atari exploration benchmark for MLS-Bench.
2#
3# FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,
4# evaluation, logging, and optimizer wiring.
5# EDITABLE section: IntrinsicBonusModule + mix_advantages.
6
7from __future__ import annotations
8
9import os
10import random
11import time
12from collections import deque
13from dataclasses import dataclass
14
15import envpool

Additional context files (read-only):

  • cleanrl/cleanrl/ppo_rnd_envpool.py
  • cleanrl/cleanrl/ppo_atari_envpool.py

Results

ModelTypeeval return private eye v5 auc private eye v5 nonzero rate private eye v5 best eval return private eye v5 eval return tutankham v5 auc tutankham v5 nonzero rate tutankham v5 best eval return tutankham v5
icmbaseline0.000-29.2910.0000.000109.00086.6351.000111.000
icmbaseline0.000-29.2910.0000.000109.00086.6351.000111.000
icmbaseline--------
ppobaseline-300.000-240.1050.66733.33336.53332.3670.33337.800
ppobaseline-300.000-240.1050.66733.33336.53332.3670.33337.800
ppobaseline--------
rndbaseline952.000-84.2300.6671400.000100.33368.3820.933109.133
rndbaseline952.000-84.2300.6671400.000100.33368.3820.933109.133
rndbaseline--------
claude-opus-4.6vanilla--------
deepseek-reasonervanilla--------
deepseek-reasonervanilla--------
deepseek-reasonervanilla--------
gemini-3.1-pro-previewvanilla--------
qwen3.6-plusvanilla--------
qwen3.6-plusvanilla--------
qwen3.6-plusvanilla--------
claude-opus-4.6agent0.000-292.9200.000100.000104.60031.8501.000105.000
deepseek-reasoneragent--------
deepseek-reasoneragent--------
deepseek-reasoneragent-22.400-610.7860.2000.0000.0000.0000.0000.000
deepseek-reasoneragent--------
gemini-3.1-pro-previewagent100.00016.4191.0001020.000114.600104.6771.000118.200
qwen3.6-plusagent--------
qwen3.6-plusagent--------
qwen3.6-plusagent100.000-137.7161.000100.000107.20092.8111.000114.000
qwen3.6-plusagent--------

Agent Conversations