Agent Conversation: gemini-3.1-pro-preview
Task: rl-intrinsic-exploration
RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design
Research Question
Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.
Background
In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.
This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.
Reference algorithm families include:
- No bonus / vanilla PPO: learns only from clipped extrinsic reward
- RND: rewards prediction error against a fixed random target network
- ICM: rewards forward-dynamics prediction error in learned feature space
Task
Modify the editable section of custom_intrinsic_exploration.py:
IntrinsicBonusModule: define how intrinsic rewards are computed and trainedmix_advantages(...): define how extrinsic and intrinsic advantages are combined
The editable code must keep the public interface intact:
initialize(envs)trainable_parameters()update_batch_stats(batch_obs, batch_next_obs)compute_bonus(obs, next_obs, actions)normalize_rollout_rewards(rollout_intrinsic)loss(batch_obs, batch_next_obs, batch_actions)mix_advantages(ext_advantages, int_advantages, args)
Evaluation
The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:
- Tutankham-v5
- Frostbite-v5
- PrivateEye-v5
Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.
Metrics:
eval_return: mean evaluation episodic return at a fixed training budgetauc: area under the evaluation-return curve across trainingnonzero_rate: fraction of evaluation episodes with non-zero episodic return
Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.
Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.
cleanrl/cleanrl/custom_intrinsic_exploration.py [EDITABLE — lines 179–219 only]
1: # Custom sparse-reward Atari exploration benchmark for MLS-Bench.
2: #
3: # FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,
4: # evaluation, logging, and optimizer wiring.
5: # EDITABLE section: IntrinsicBonusModule + mix_advantages.
6:
7: from __fuTool Input
1{
2 "start_line": 180,
3 "content": "class IntrinsicBonusModule(nn.Module):\n \"\"\"Controllable Feature Distillation with Frontier Novelty (CFD-FN).\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n self.encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.LayerNorm(256)\n )\n self.inverse_model = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n self.predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 256)),\n )\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_obs = self._normalize_obs(obs)\n norm_next = self._normalize_obs(next_obs)\n \n with torch.no_grad():\n target_feat_obs = self.encoder(norm_obs)\n pred_feat_obs = self.predictor(norm_obs)\n error_obs = ((target_feat_obs - pred_feat_obs).pow(2).sum(1) / 2)\n \n target_feat_next = self.encoder(norm_next)\n pred_feat_next = self.predictor(norm_next)\n error_next = ((target_feat_next - pred_feat_next).pow(2).sum(1) / 2)\n \n bonus = torch.clamp(error_next - error_obs, min=0.0) + 0.5 * error_next\n return bonus.detach()\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_obs = self._normalize_obs(batch_obs)\n norm_next = self._normalize_obs(batch_next_obs)\n \n obs_feat = self.encoder(norm_obs)\n next_feat = self.encoder(norm_next)\n pred_action = self.inverse_model(torch.cat([obs_feat, next_feat], dim=1))\n inverse_loss = F.cross_entropy(pred_action, batch_actions.long())\n \n pred_next_feat = self.predictor(norm_next)\n distill_loss_unmasked = F.mse_loss(pred_next_feat, next_feat.detach(), reduction=\"none\").mean(-1)\n \n mask = (torch.rand(len(distill_loss_unmasked), device=self.device) < self.args.update_proportion).float()\n distill_loss = (distill_loss_unmasked * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n \n return inverse_loss + distill_loss\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages",
4 "op": "replace",
5 "end_line": 219,
6 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py"
7}OK: Replaced lines 180..219 with 126 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–305.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–305 | total: 578 lines]
179 # =====================================================================
180 class IntrinsicBonusModule(nn.Module):
181 """Controllable Feature Distillation with Frontier Novelty (CFD-FN)."""
... (121 more lines) ...
303
304 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
305 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=122.0500 avg_intrinsic=0.009762 eval_return=nan nonzero_rate=nan sps=1062 TRAIN_METRICS step=9986048 avg_return=122.1000 avg_intrinsic=0.010097 eval_return=nan nonzero_rate=nan sps=1062 TRAIN_METRICS step=9990144 avg_return=122.1000 avg_intrinsic=0.009254 eval_return=nan nonzero_rate=nan sps=1063 TRAIN_METRICS step=9994240 avg_return=123.2000 avg_intrinsic=0.008857 eval_return=nan nonzero_rate=nan sps=1063 TRAIN_METRICS step=9998336 avg_return=124.2000 avg_intrinsic=0.007928 eval_return=41.4000 nonzero_rate=1.0000 sps=1033 Final evaluation: TEST_METRICS eval_return=41.4000 auc=33.268656 nonzero_rate=1.0000 best_eval_return=105.4000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=2535.5000 avg_intrinsic=0.016350 eval_return=
Tool Input
1{
2 "end_line": 305,
3 "content": "# =====================================================================\n# EDITABLE: intrinsic reward design\n# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Hybrid Frontier Distillation (BeBOLD-RND + CFD).\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n\n # ---------------- RND Components ----------------\n self.rnd_target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.rnd_target.parameters():\n param.requires_grad = False\n\n self.rnd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n\n # ---------------- CFD Components ----------------\n self.cfd_encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.LayerNorm(256)\n )\n self.cfd_inverse = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n self.cfd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 256)),\n )\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.rnd_predictor.parameters()) + \\\n list(self.cfd_encoder.parameters()) + \\\n list(self.cfd_inverse.parameters()) + \\\n list(self.cfd_predictor.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def compute_bonus(self, obs: torch.Tensor, next_obs: torch.Tensor, actions: torch.Tensor) -> torch.Tensor:\n norm_obs = self._normalize_obs(obs)\n norm_next = self._normalize_obs(next_obs)\n \n with torch.no_grad():\n rnd_t_obs = self.rnd_target(norm_obs)\n rnd_p_obs = self.rnd_predictor(norm_obs)\n rnd_err_obs = ((rnd_t_obs - rnd_p_obs).pow(2).sum(1) / 2)\n \n rnd_t_next = self.rnd_target(norm_next)\n rnd_p_next = self.rnd_predictor(norm_next)\n rnd_err_next = ((rnd_t_next - rnd_p_next).pow(2).sum(1) / 2)\n \n cfd_t_obs = self.cfd_encoder(norm_obs)\n cfd_p_obs = self.cfd_predictor(norm_obs)\n cfd_err_obs = ((cfd_t_obs - cfd_p_obs).pow(2).sum(1) / 2)\n \n cfd_t_next = self.cfd_encoder(norm_next)\n cfd_p_next = self.cfd_predictor(norm_next)\n cfd_err_next = ((cfd_t_next - cfd_p_next).pow(2).sum(1) / 2)\n \n rnd_bonus = torch.clamp(rnd_err_next - rnd_err_obs, min=0.0) + 0.1 * rnd_err_next\n cfd_bonus = torch.clamp(cfd_err_next - cfd_err_obs, min=0.0) + 0.1 * cfd_err_next\n \n bonus = rnd_bonus + cfd_bonus\n return bonus.detach()\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:\n norm_obs = self._normalize_obs(batch_obs)\n norm_next = self._normalize_obs(batch_next_obs)\n \n # 1. RND Loss\n rnd_p_next = self.rnd_predictor(norm_next)\n rnd_t_next = self.rnd_target(norm_next).detach()\n rnd_loss_unmasked = F.mse_loss(rnd_p_next, rnd_t_next, reduction=\"none\").mean(-1)\n \n # 2. CFD Loss\n cfd_e_obs = self.cfd_encoder(norm_obs)\n cfd_e_next = self.cfd_encoder(norm_next)\n \n cfd_inv_pred = self.cfd_inverse(torch.cat([cfd_e_obs, cfd_e_next], dim=1))\n cfd_inv_loss = F.cross_entropy(cfd_inv_pred, batch_actions.long())\n \n cfd_p_next = self.cfd_predictor(norm_next)\n cfd_distill_unmasked = F.mse_loss(cfd_p_next, cfd_e_next.detach(), reduction=\"none\").mean(-1)\n \n mask = (torch.rand(len(rnd_loss_unmasked), device=self.device) < self.args.update_proportion).float()\n \n rnd_distill_loss = (rnd_loss_unmasked * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n cfd_distill_loss = (cfd_distill_unmasked * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n \n return rnd_distill_loss + cfd_inv_loss + cfd_distill_loss\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages",
4 "op": "replace",
5 "start_line": 179,
6 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py"
7}OK: Replaced lines 179..305 with 173 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–351.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–351 | total: 624 lines]
179 # =====================================================================
180 # EDITABLE: intrinsic reward design
181 # =====================================================================
... (167 more lines) ...
349
350 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
351 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=162.8500 avg_intrinsic=0.004785 eval_return=nan nonzero_rate=nan sps=805 TRAIN_METRICS step=9986048 avg_return=162.8500 avg_intrinsic=0.004987 eval_return=nan nonzero_rate=nan sps=805 TRAIN_METRICS step=9990144 avg_return=162.1000 avg_intrinsic=0.004947 eval_return=nan nonzero_rate=nan sps=806 TRAIN_METRICS step=9994240 avg_return=161.8500 avg_intrinsic=0.004903 eval_return=nan nonzero_rate=nan sps=806 TRAIN_METRICS step=9998336 avg_return=161.3000 avg_intrinsic=0.004356 eval_return=105.2000 nonzero_rate=1.0000 sps=791 Final evaluation: TEST_METRICS eval_return=105.2000 auc=47.660142 nonzero_rate=1.0000 best_eval_return=110.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=196.5000 avg_intrinsic=0.016616 eval_return=nan n
Tool Input
1{
2 "end_line": 351,
3 "content": "# =====================================================================\n# EDITABLE: intrinsic reward design\n# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Hybrid Normalized Exploration (BeBOLD-CFD + Pure RND).\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n\n feature_output = 7 * 7 * 64\n\n # ---------------- RND Components ----------------\n self.rnd_target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.LayerNorm(512)\n )\n for param in self.rnd_target.parameters():\n param.requires_grad = False\n\n self.rnd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n\n # ---------------- CFD Components ----------------\n self.cfd_encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 256)),\n nn.LayerNorm(256)\n )\n self.cfd_inverse = nn.Sequential(\n layer_init(nn.Linear(512, 256)),\n nn.ReLU(),\n layer_init(nn.Linear(256, action_dim), std=0.01),\n )\n self.cfd_predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 256)),\n )\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.rnd_predictor.parameters()) + \\\n list(self.cfd_encoder.parameters()) + \\\n list(self.cfd_inverse.parameters()) + \\\n list(self.cfd_predictor.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def compute_bonus(self, obs: torch.Tensor, next_obs: torch.Tensor, actions: torch.Tensor) -> torch.Tensor:\n norm_obs = self._normalize_obs(obs)\n norm_next = self._normalize_obs(next_obs)\n \n with torch.no_grad():\n rnd_t_next = self.rnd_target(norm_next)\n rnd_p_next = self.rnd_predictor(norm_next)\n rnd_err_next = ((rnd_t_next - rnd_p_next).pow(2).sum(1) / 512.0)\n \n cfd_t_obs = self.cfd_encoder(norm_obs)\n cfd_p_obs = self.cfd_predictor(norm_obs)\n cfd_err_obs = ((cfd_t_obs - cfd_p_obs).pow(2).sum(1) / 256.0)\n \n cfd_t_next = self.cfd_encoder(norm_next)\n cfd_p_next = self.cfd_predictor(norm_next)\n cfd_err_next = ((cfd_t_next - cfd_p_next).pow(2).sum(1) / 256.0)\n \n cfd_bonus = torch.clamp(cfd_err_next - cfd_err_obs, min=0.0)\n rnd_bonus = rnd_err_next\n \n bonus = rnd_bonus + 2.0 * cfd_bonus\n return bonus.detach()\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:\n norm_obs = self._normalize_obs(batch_obs)\n norm_next = self._normalize_obs(batch_next_obs)\n \n # RND Loss\n rnd_p_next = self.rnd_predictor(norm_next)\n rnd_t_next = self.rnd_target(norm_next).detach()\n rnd_loss_unmasked = F.mse_loss(rnd_p_next, rnd_t_next, reduction=\"none\").mean(-1)\n \n # CFD Loss\n cfd_e_obs = self.cfd_encoder(norm_obs)\n cfd_e_next = self.cfd_encoder(norm_next)\n \n cfd_inv_pred = self.cfd_inverse(torch.cat([cfd_e_obs, cfd_e_next], dim=1))\n cfd_inv_loss = F.cross_entropy(cfd_inv_pred, batch_actions.long())\n \n cfd_p_next = self.cfd_predictor(norm_next)\n cfd_distill_unmasked = F.mse_loss(cfd_p_next, cfd_e_next.detach(), reduction=\"none\").mean(-1)\n \n mask = (torch.rand(len(rnd_loss_unmasked), device=self.device) < self.args.update_proportion).float()\n \n rnd_distill_loss = (rnd_loss_unmasked * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n cfd_distill_loss = (cfd_distill_unmasked * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n \n return rnd_distill_loss + cfd_inv_loss + cfd_distill_loss\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages",
4 "start_line": 179,
5 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
6 "op": "replace"
7}OK: Replaced lines 179..351 with 170 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–348.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–348 | total: 621 lines]
179 # =====================================================================
180 # EDITABLE: intrinsic reward design
181 # =====================================================================
... (164 more lines) ...
346
347 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
348 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=114.2500 avg_intrinsic=0.001868 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9986048 avg_return=114.2500 avg_intrinsic=0.001857 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9990144 avg_return=114.6500 avg_intrinsic=0.001951 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9994240 avg_return=115.0500 avg_intrinsic=0.001845 eval_return=nan nonzero_rate=nan sps=1225 TRAIN_METRICS step=9998336 avg_return=114.8500 avg_intrinsic=0.001756 eval_return=114.6000 nonzero_rate=1.0000 sps=1223 Final evaluation: TEST_METRICS eval_return=114.6000 auc=104.676900 nonzero_rate=1.0000 best_eval_return=118.2000 ### frostbite-v5 (scripts/fr
Tool Input
1{
2 "n": 3
3}[submit] Submitting result from test #3 as final. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=114.2500 avg_intrinsic=0.001868 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9986048 avg_return=114.2500 avg_intrinsic=0.001857 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9990144 avg_return=114.6500 avg_intrinsic=0.001951 eval_return=nan nonzero_rate=nan sps=1224 TRAIN_METRICS step=9994240 avg_return=115.0500 avg_intrinsic=0.001845 eval_return=nan nonzero_rate=nan sps=1225 TRAIN_METRICS step=9998336 avg_return=114.8500 avg_intrinsic=0.001756 eval_return=114.6000 nonzero_rate=1.0000 sps=1223 Final evaluation: TEST_METRICS eval_return=114.6000 auc=104.676900 nonzero_rate=1.0000 best_eval_return=118.2000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=2558.5000 avg_intrinsic=0.019990 eval_return=nan nonzero_rate=nan sps=1290 TRAIN_METRIC