rl-intrinsic-exploration

Reinforcement Learningcleanrlrigorous codebase

Description

RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design

Research Question

Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.

Background

In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.

This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.

Reference algorithm families include:

No bonus / vanilla PPO: learns only from clipped extrinsic reward
RND: rewards prediction error against a fixed random target network
ICM: rewards forward-dynamics prediction error in learned feature space

Task

Modify the editable section of custom_intrinsic_exploration.py:

IntrinsicBonusModule: define how intrinsic rewards are computed and trained
mix_advantages(...): define how extrinsic and intrinsic advantages are combined

The editable code must keep the public interface intact:

initialize(envs)
trainable_parameters()
update_batch_stats(batch_obs, batch_next_obs)
compute_bonus(obs, next_obs, actions)
normalize_rollout_rewards(rollout_intrinsic)
loss(batch_obs, batch_next_obs, batch_actions)
mix_advantages(ext_advantages, int_advantages, args)

Evaluation

The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:

Tutankham-v5
Frostbite-v5
PrivateEye-v5

Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.

Metrics:

eval_return: mean evaluation episodic return at a fixed training budget
auc: area under the evaluation-return curve across training
nonzero_rate: fraction of evaluation episodes with non-zero episodic return

Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.

Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.

Code

custom_intrinsic_exploration.py

EditableRead-only

1# Custom sparse-reward Atari exploration benchmark for MLS-Bench.
2#
3# FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,
4# evaluation, logging, and optimizer wiring.
5# EDITABLE section: IntrinsicBonusModule + mix_advantages.
6
7from __future__ import annotations
8
9import os
10import random
11import time
12from collections import deque
13from dataclasses import dataclass
14
15import envpool

Additional context files (read-only):

cleanrl/cleanrl/ppo_rnd_envpool.py
cleanrl/cleanrl/ppo_atari_envpool.py

Results

Show per-seed results

Model	Type	eval return private eye v5 ↑	auc private eye v5 ↑	nonzero rate private eye v5 ↑	best eval return private eye v5 ↑	eval return tutankham v5 ↑	auc tutankham v5 ↑	nonzero rate tutankham v5 ↑	best eval return tutankham v5 ↑
icm	baseline	0.000	-29.291	0.000	0.000	109.000	86.635	1.000	111.000
icm	baseline	0.000	-29.291	0.000	0.000	109.000	86.635	1.000	111.000
icm	baseline	-	-	-	-	-	-	-	-
ppo	baseline	-300.000	-240.105	0.667	33.333	36.533	32.367	0.333	37.800
ppo	baseline	-300.000	-240.105	0.667	33.333	36.533	32.367	0.333	37.800
ppo	baseline	-	-	-	-	-	-	-	-
rnd	baseline	952.000	-84.230	0.667	1400.000	100.333	68.382	0.933	109.133
rnd	baseline	952.000	-84.230	0.667	1400.000	100.333	68.382	0.933	109.133
rnd	baseline	-	-	-	-	-	-	-	-
claude-opus-4.6	vanilla	-	-	-	-	-	-	-	-
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-
gemini-3.1-pro-preview	vanilla	-	-	-	-	-	-	-	-
qwen3.6-plus	vanilla	-	-	-	-	-	-	-	-
qwen3.6-plus	vanilla	-	-	-	-	-	-	-	-
qwen3.6-plus	vanilla	-	-	-	-	-	-	-	-
claude-opus-4.6	agent	0.000	-292.920	0.000	100.000	104.600	31.850	1.000	105.000
deepseek-reasoner	agent	-	-	-	-	-	-	-	-
deepseek-reasoner	agent	-	-	-	-	-	-	-	-
deepseek-reasoner	agent	-22.400	-610.786	0.200	0.000	0.000	0.000	0.000	0.000
deepseek-reasoner	agent	-	-	-	-	-	-	-	-
gemini-3.1-pro-preview	agent	100.000	16.419	1.000	1020.000	114.600	104.677	1.000	118.200
qwen3.6-plus	agent	-	-	-	-	-	-	-	-
qwen3.6-plus	agent	-	-	-	-	-	-	-	-
qwen3.6-plus	agent	100.000	-137.716	1.000	100.000	107.200	92.811	1.000	114.000
qwen3.6-plus	agent	-	-	-	-	-	-	-	-

Agent Conversations

gemini-3.1-pro-preview

7 steps

qwen3.6-plus

17 steps