Agent Conversation: claude-opus-4.6
Task: rl-intrinsic-exploration
RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design
Research Question
Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.
Background
In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.
This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.
Reference algorithm families include:
- No bonus / vanilla PPO: learns only from clipped extrinsic reward
- RND: rewards prediction error against a fixed random target network
- ICM: rewards forward-dynamics prediction error in learned feature space
Task
Modify the editable section of custom_intrinsic_exploration.py:
IntrinsicBonusModule: define how intrinsic rewards are computed and trainedmix_advantages(...): define how extrinsic and intrinsic advantages are combined
The editable code must keep the public interface intact:
initialize(envs)trainable_parameters()update_batch_stats(batch_obs, batch_next_obs)compute_bonus(obs, next_obs, actions)normalize_rollout_rewards(rollout_intrinsic)loss(batch_obs, batch_next_obs, batch_actions)mix_advantages(ext_advantages, int_advantages, args)
Evaluation
The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:
- Tutankham-v5
- Frostbite-v5
- PrivateEye-v5
Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.
Metrics:
eval_return: mean evaluation episodic return at a fixed training budgetauc: area under the evaluation-return curve across trainingnonzero_rate: fraction of evaluation episodes with non-zero episodic return
Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.
Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.
cleanrl/cleanrl/custom_intrinsic_exploration.py [EDITABLE — lines 179–219 only]
1: # Custom sparse-reward Atari exploration benchmark for MLS-Bench.
2: #
3: # FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,
4: # evaluation, logging, and optimizer wiring.
5: # EDITABLE section: IntrinsicBonusModule + mix_advantages.
6:
7: from __fuTool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
4 "start_line": 179,
5 "end_line": 219,
6 "content": "# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Dual-Signal Intrinsic Motivation: RND + Forward Dynamics hybrid.\n\n Combines state-level novelty (RND prediction error against a frozen random\n target) with transition-level novelty (action-conditioned forward dynamics\n prediction error in a learned feature space). The two signals are\n complementary: RND excels at long-horizon exploration (PrivateEye) while\n forward dynamics captures action-dependent surprises (Frostbite). Both\n bonuses are summed after shared reward normalization.\n \"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n\n # ---------- RND component ----------\n self.rnd_target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.rnd_target.parameters():\n param.requires_grad = False\n\n self.rnd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n\n # ---------- Forward-dynamics component ----------\n # Shared encoder for obs/next_obs feature extraction\n self.fd_encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.ReLU(),\n )\n # Inverse model regularises the encoder to capture controllable aspects\n self.inverse_model = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n # Forward model predicts next-state features from (state_feat, action)\n self.forward_model = nn.Sequential(\n layer_init(nn.Linear(256 + action_dim, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, 256)),\n )\n\n # Relative weight of forward-dynamics bonus vs RND bonus\n self.fd_weight = 0.5\n\n # ------------------------------------------------------------------\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(\n 0, envs.single_action_space.n, size=(self.args.num_envs,)\n )\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return (\n list(self.rnd_predictor.parameters())\n + list(self.fd_encoder.parameters())\n + list(self.inverse_model.parameters())\n + list(self.forward_model.parameters())\n )\n\n # ------------------------------------------------------------------\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def _one_hot(self, actions: torch.Tensor) -> torch.Tensor:\n return F.one_hot(actions.long(), num_classes=self.action_dim).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n # ------------------------------------------------------------------\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(next_obs)\n norm_curr = self._normalize_obs(obs)\n\n # RND bonus: prediction error on next-state features\n target_feat = self.rnd_target(norm_next)\n pred_feat = self.rnd_predictor(norm_next)\n rnd_bonus = (target_feat - pred_feat).pow(2).sum(1) / 2\n\n # Forward-dynamics bonus: prediction error of next-state latent\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(actions)], dim=1)\n )\n fd_bonus = 0.5 * (pred_next_z - next_z).pow(2).mean(dim=1)\n\n return (rnd_bonus + self.fd_weight * fd_bonus).detach()\n\n # ------------------------------------------------------------------\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [\n self.discounted_reward.update(rps)\n for rps in rollout_intrinsic.cpu().numpy()\n ],\n axis=0,\n )\n flat = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat.mean()), float(flat.var()), int(flat.size)\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n # ------------------------------------------------------------------\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(batch_next_obs)\n norm_curr = self._normalize_obs(batch_obs)\n\n # RND loss (with subsampling mask, same as reference RND)\n pred_feat = self.rnd_predictor(norm_next)\n target_feat = self.rnd_target(norm_next).detach()\n rnd_per = F.mse_loss(pred_feat, target_feat, reduction=\"none\").mean(-1)\n mask = (\n torch.rand(len(rnd_per), device=self.device) < self.args.update_proportion\n ).float()\n rnd_loss = (rnd_per * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n\n # Forward / inverse dynamics loss\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(batch_actions)], dim=1)\n )\n forward_loss = 0.5 * (pred_next_z - next_z.detach()).pow(2).mean()\n pred_action = self.inverse_model(torch.cat([curr_z, next_z], dim=1))\n inverse_loss = F.cross_entropy(pred_action, batch_actions.long())\n\n return rnd_loss + inverse_loss + 0.2 * forward_loss\n\n\ndef mix_advantages(\n ext_advantages: torch.Tensor,\n int_advantages: torch.Tensor,\n args: Args,\n) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages\n\n"
7}OK: Replaced lines 179..219 with 195 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–373.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–373 | total: 646 lines]
179 # =====================================================================
180 class IntrinsicBonusModule(nn.Module):
181 """Dual-Signal Intrinsic Motivation: RND + Forward Dynamics hybrid.
... (189 more lines) ...
371 ) -> torch.Tensor:
372 return args.ext_coef * ext_advantages + args.int_coef * int_advantages
373 [Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=139.9000 avg_intrinsic=0.003319 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9986048 avg_return=139.9000 avg_intrinsic=0.003593 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9990144 avg_return=139.0000 avg_intrinsic=0.003790 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9994240 avg_return=140.9500 avg_intrinsic=0.003254 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9998336 avg_return=140.9500 avg_intrinsic=0.003459 eval_return=104.6000 nonzero_rate=1.0000 sps=1110 Final evaluation: TEST_METRICS eval_return=104.6000 auc=31.850061 nonzero_rate=1.0000 best_eval_return=105.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=3377.5000 avg_intrinsic=0.019090 eval_retur
Tool Input
1{
2 "content": "# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Dual-Signal Intrinsic Motivation (DSIM-v2): RND + Forward Dynamics hybrid.\n\n Combines state-level novelty (RND) with transition-level novelty (forward\n dynamics). Key improvement over v1: each bonus signal is normalized\n independently before combining, preventing the forward-dynamics signal\n (which can be noisy early on) from drowning out the RND signal that drives\n long-horizon exploration. The fd_weight is kept modest so RND dominates in\n hard-exploration games like PrivateEye while the forward component provides\n an additive boost in games like Frostbite where action-dependent surprises\n matter.\n \"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n # Separate reward normalization for each signal\n self.rnd_reward_rms = RunningMeanStd()\n self.fd_reward_rms = RunningMeanStd()\n self.rnd_discounted = RewardForwardFilter(args.int_gamma)\n self.fd_discounted = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n\n # ---------- RND component ----------\n self.rnd_target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.rnd_target.parameters():\n param.requires_grad = False\n\n self.rnd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n\n # ---------- Forward-dynamics component ----------\n self.fd_encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.ReLU(),\n )\n self.inverse_model = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n self.forward_model = nn.Sequential(\n layer_init(nn.Linear(256 + action_dim, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, 256)),\n )\n\n # Relative weight: after each signal is independently normalised,\n # fd_weight controls how much forward-dynamics contributes.\n # Kept modest (0.3) so RND dominates in hard-exploration games.\n self.fd_weight = 0.3\n\n # Buffers to store per-step raw bonuses for separate normalization\n self._rnd_rollout = None\n self._fd_rollout = None\n\n # ------------------------------------------------------------------\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(\n 0, envs.single_action_space.n, size=(self.args.num_envs,)\n )\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return (\n list(self.rnd_predictor.parameters())\n + list(self.fd_encoder.parameters())\n + list(self.inverse_model.parameters())\n + list(self.forward_model.parameters())\n )\n\n # ------------------------------------------------------------------\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def _one_hot(self, actions: torch.Tensor) -> torch.Tensor:\n return F.one_hot(actions.long(), num_classes=self.action_dim).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n # ------------------------------------------------------------------\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(next_obs)\n norm_curr = self._normalize_obs(obs)\n\n # RND bonus\n target_feat = self.rnd_target(norm_next)\n pred_feat = self.rnd_predictor(norm_next)\n rnd_bonus = (target_feat - pred_feat).pow(2).sum(1) / 2\n\n # Forward-dynamics bonus\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(actions)], dim=1)\n )\n fd_bonus = 0.5 * (pred_next_z - next_z).pow(2).mean(dim=1)\n\n # Return sum; we store components for separate normalisation later\n # The caller stores per-step results into int_rewards; we piggy-back\n # by returning combined raw bonus and keeping component buffers.\n # However, the training loop only calls compute_bonus once per step\n # and then normalize_rollout_rewards at the end of the rollout.\n # We cache per-step components.\n if self._rnd_rollout is None:\n self._rnd_rollout = []\n self._fd_rollout = []\n self._rnd_rollout.append(rnd_bonus.detach())\n self._fd_rollout.append(fd_bonus.detach())\n\n return (rnd_bonus + fd_bonus).detach() # raw combined (will be replaced in normalize)\n\n # ------------------------------------------------------------------\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n \"\"\"Independently normalise RND and FD bonuses then combine.\"\"\"\n num_steps = rollout_intrinsic.shape[0]\n num_envs = rollout_intrinsic.shape[1]\n\n # Reconstruct per-step tensors from cached components\n if self._rnd_rollout is not None and len(self._rnd_rollout) == num_steps:\n rnd_rollout = torch.stack(self._rnd_rollout, dim=0) # (T, E)\n fd_rollout = torch.stack(self._fd_rollout, dim=0)\n else:\n # Fallback: treat entire rollout_intrinsic as RND-only\n rnd_rollout = rollout_intrinsic\n fd_rollout = torch.zeros_like(rollout_intrinsic)\n\n # Clear cache for next rollout\n self._rnd_rollout = None\n self._fd_rollout = None\n\n # Normalize RND rewards\n rnd_disc = np.stack(\n [self.rnd_discounted.update(rps) for rps in rnd_rollout.cpu().numpy()],\n axis=0,\n )\n flat_rnd = rnd_disc.reshape(-1)\n self.rnd_reward_rms.update_from_moments(\n float(flat_rnd.mean()), float(flat_rnd.var()), int(flat_rnd.size)\n )\n rnd_normalized = rnd_rollout / float(np.sqrt(self.rnd_reward_rms.var + 1e-8))\n\n # Normalize FD rewards\n fd_disc = np.stack(\n [self.fd_discounted.update(rps) for rps in fd_rollout.cpu().numpy()],\n axis=0,\n )\n flat_fd = fd_disc.reshape(-1)\n self.fd_reward_rms.update_from_moments(\n float(flat_fd.mean()), float(flat_fd.var()), int(flat_fd.size)\n )\n fd_normalized = fd_rollout / float(np.sqrt(self.fd_reward_rms.var + 1e-8))\n\n # Combine with weighting: RND is primary, FD is supplementary\n return rnd_normalized + self.fd_weight * fd_normalized\n\n # ------------------------------------------------------------------\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(batch_next_obs)\n norm_curr = self._normalize_obs(batch_obs)\n\n # RND loss (with subsampling mask)\n pred_feat = self.rnd_predictor(norm_next)\n target_feat = self.rnd_target(norm_next).detach()\n rnd_per = F.mse_loss(pred_feat, target_feat, reduction=\"none\").mean(-1)\n mask = (\n torch.rand(len(rnd_per), device=self.device) < self.args.update_proportion\n ).float()\n rnd_loss = (rnd_per * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n\n # Forward / inverse dynamics loss\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(batch_actions)], dim=1)\n )\n forward_loss = 0.5 * (pred_next_z - next_z.detach()).pow(2).mean()\n pred_action = self.inverse_model(torch.cat([curr_z, next_z], dim=1))\n inverse_loss = F.cross_entropy(pred_action, batch_actions.long())\n\n return rnd_loss + inverse_loss + 0.2 * forward_loss\n\n\ndef mix_advantages(\n ext_advantages: torch.Tensor,\n\n... (truncated)",
3 "end_line": 373,
4 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
5 "op": "replace",
6 "start_line": 179
7}OK: Replaced lines 179..373 with 244 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–422.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–422 | total: 695 lines]
179 # =====================================================================
180 class IntrinsicBonusModule(nn.Module):
181 """Dual-Signal Intrinsic Motivation (DSIM-v2): RND + Forward Dynamics hybrid.
... (238 more lines) ...
420 ) -> torch.Tensor:
421 return args.ext_coef * ext_advantages + args.int_coef * int_advantages
422 [Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=103.7500 avg_intrinsic=0.008161 eval_return=nan nonzero_rate=nan sps=1888 TRAIN_METRICS step=9986048 avg_return=103.8000 avg_intrinsic=0.008111 eval_return=nan nonzero_rate=nan sps=1888 TRAIN_METRICS step=9990144 avg_return=104.2000 avg_intrinsic=0.008478 eval_return=nan nonzero_rate=nan sps=1889 TRAIN_METRICS step=9994240 avg_return=104.0500 avg_intrinsic=0.008832 eval_return=nan nonzero_rate=nan sps=1889 TRAIN_METRICS step=9998336 avg_return=103.8000 avg_intrinsic=0.008943 eval_return=107.6000 nonzero_rate=1.0000 sps=1882 Final evaluation: TEST_METRICS eval_return=107.6000 auc=97.740142 nonzero_rate=1.0000 best_eval_return=109.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=288.5000 avg_intrinsic=0.020536 eval_return=
Tool Input
1{
2 "content": "# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Dual-Signal Intrinsic Motivation: RND + Forward Dynamics hybrid.\n\n Combines state-level novelty (RND) with transition-level novelty (forward\n dynamics). RND drives long-horizon exploration (PrivateEye) while forward\n dynamics boosts action-dependent surprise (Frostbite/Tutankham).\n Single-stream reward normalization keeps the implementation simple and\n stable. The FD component uses a modest weight so it enhances but does not\n overwhelm the RND signal.\n \"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n\n # ---------- RND component ----------\n self.rnd_target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.rnd_target.parameters():\n param.requires_grad = False\n\n self.rnd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n\n # ---------- Forward-dynamics component ----------\n self.fd_encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.ReLU(),\n )\n self.inverse_model = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n self.forward_model = nn.Sequential(\n layer_init(nn.Linear(256 + action_dim, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, 256)),\n )\n\n # FD bonus weight relative to RND (kept small to avoid overwhelming RND)\n self.fd_weight = 0.3\n\n # ------------------------------------------------------------------\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(\n 0, envs.single_action_space.n, size=(self.args.num_envs,)\n )\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return (\n list(self.rnd_predictor.parameters())\n + list(self.fd_encoder.parameters())\n + list(self.inverse_model.parameters())\n + list(self.forward_model.parameters())\n )\n\n # ------------------------------------------------------------------\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def _one_hot(self, actions: torch.Tensor) -> torch.Tensor:\n return F.one_hot(actions.long(), num_classes=self.action_dim).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n # ------------------------------------------------------------------\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(next_obs)\n norm_curr = self._normalize_obs(obs)\n\n # RND bonus: prediction error on next-state features\n target_feat = self.rnd_target(norm_next)\n pred_feat = self.rnd_predictor(norm_next)\n rnd_bonus = (target_feat - pred_feat).pow(2).sum(1) / 2\n\n # Forward-dynamics bonus: prediction error of next-state latent\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(actions)], dim=1)\n )\n fd_bonus = 0.5 * (pred_next_z - next_z).pow(2).mean(dim=1)\n\n # Combine with weighting; single-stream normalization handles scale\n return (rnd_bonus + self.fd_weight * fd_bonus).detach()\n\n # ------------------------------------------------------------------\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [\n self.discounted_reward.update(rps)\n for rps in rollout_intrinsic.cpu().numpy()\n ],\n axis=0,\n )\n flat = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat.mean()), float(flat.var()), int(flat.size)\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n # ------------------------------------------------------------------\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(batch_next_obs)\n norm_curr = self._normalize_obs(batch_obs)\n\n # RND loss (with subsampling mask)\n pred_feat = self.rnd_predictor(norm_next)\n target_feat = self.rnd_target(norm_next).detach()\n rnd_per = F.mse_loss(pred_feat, target_feat, reduction=\"none\").mean(-1)\n mask = (\n torch.rand(len(rnd_per), device=self.device) < self.args.update_proportion\n ).float()\n rnd_loss = (rnd_per * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n\n # Forward / inverse dynamics loss\n curr_z = self.fd_encoder(norm_curr)\n next_z = self.fd_encoder(norm_next)\n pred_next_z = self.forward_model(\n torch.cat([curr_z, self._one_hot(batch_actions)], dim=1)\n )\n forward_loss = 0.5 * (pred_next_z - next_z.detach()).pow(2).mean()\n pred_action = self.inverse_model(torch.cat([curr_z, next_z], dim=1))\n inverse_loss = F.cross_entropy(pred_action, batch_actions.long())\n\n return rnd_loss + inverse_loss + 0.2 * forward_loss\n\n\ndef mix_advantages(\n ext_advantages: torch.Tensor,\n int_advantages: torch.Tensor,\n args: Args,\n) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages\n\n",
3 "end_line": 422,
4 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
5 "op": "replace",
6 "start_line": 179
7}OK: Replaced lines 179..422 with 193 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–371.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–371 | total: 644 lines]
179 # =====================================================================
180 class IntrinsicBonusModule(nn.Module):
181 """Dual-Signal Intrinsic Motivation: RND + Forward Dynamics hybrid.
... (187 more lines) ...
369 ) -> torch.Tensor:
370 return args.ext_coef * ext_advantages + args.int_coef * int_advantages
371 [Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=165.7000 avg_intrinsic=0.001882 eval_return=nan nonzero_rate=nan sps=1032 TRAIN_METRICS step=9986048 avg_return=165.7000 avg_intrinsic=0.002117 eval_return=nan nonzero_rate=nan sps=1032 TRAIN_METRICS step=9990144 avg_return=165.7000 avg_intrinsic=0.002471 eval_return=nan nonzero_rate=nan sps=1032 TRAIN_METRICS step=9994240 avg_return=165.9000 avg_intrinsic=0.002082 eval_return=nan nonzero_rate=nan sps=1032 TRAIN_METRICS step=9998336 avg_return=165.9500 avg_intrinsic=0.002078 eval_return=35.8000 nonzero_rate=1.0000 sps=1009 Final evaluation: TEST_METRICS eval_return=35.8000 auc=43.127573 nonzero_rate=1.0000 best_eval_return=104.4000 ### frostbite-v5 (scripts/frost
Tool Input
1{
2 "n": 1
3}[submit] Submitting result from test #1 as final. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=139.9000 avg_intrinsic=0.003319 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9986048 avg_return=139.9000 avg_intrinsic=0.003593 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9990144 avg_return=139.0000 avg_intrinsic=0.003790 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9994240 avg_return=140.9500 avg_intrinsic=0.003254 eval_return=nan nonzero_rate=nan sps=1136 TRAIN_METRICS step=9998336 avg_return=140.9500 avg_intrinsic=0.003459 eval_return=104.6000 nonzero_rate=1.0000 sps=1110 Final evaluation: TEST_METRICS eval_return=104.6000 auc=31.850061 nonzero_rate=1.0000 best_eval_return=105.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=3377.5000 avg_intrinsic=0.019090 eval_return=nan nonzero_rate=nan sps=1430 TRAIN_METRICS