Agent Conversation: deepseek-reasoner
Task: rl-intrinsic-exploration
RL Intrinsic Exploration: Sparse-Reward Novelty Bonus Design
Research Question
Design an intrinsic exploration mechanism that improves sparse-reward discovery in hard-exploration Atari environments.
Background
In sparse-reward reinforcement learning, external rewards arrive too infrequently for vanilla policy optimization to learn efficiently. A common solution is to add an intrinsic reward that encourages novelty, surprise, or state-space coverage.
This task isolates that question. The PPO training loop, Atari preprocessing, policy/value architecture, and optimization setup are fixed. The only thing you should redesign is the intrinsic bonus module and how its signal is mixed into learning.
Reference algorithm families include:
- No bonus / vanilla PPO: learns only from clipped extrinsic reward
- RND: rewards prediction error against a fixed random target network
- ICM: rewards forward-dynamics prediction error in learned feature space
Task
Modify the editable section of custom_intrinsic_exploration.py:
IntrinsicBonusModule: define how intrinsic rewards are computed and trainedmix_advantages(...): define how extrinsic and intrinsic advantages are combined
The editable code must keep the public interface intact:
initialize(envs)trainable_parameters()update_batch_stats(batch_obs, batch_next_obs)compute_bonus(obs, next_obs, actions)normalize_rollout_rewards(rollout_intrinsic)loss(batch_obs, batch_next_obs, batch_actions)mix_advantages(ext_advantages, int_advantages, args)
Evaluation
The agent is trained with the same fixed PPO-style loop on three sparse-reward Atari environments:
- Tutankham-v5
- Frostbite-v5
- PrivateEye-v5
Tutankham-v5 and Frostbite-v5 are visible during development. PrivateEye-v5 is held out as the hidden evaluation environment.
Metrics:
eval_return: mean evaluation episodic return at a fixed training budgetauc: area under the evaluation-return curve across trainingnonzero_rate: fraction of evaluation episodes with non-zero episodic return
Evaluation uses deterministic rollouts with a fixed per-episode step cap so a non-terminating Atari behavior cannot stall the benchmark.
Improvement must transfer across multiple games; a method that only helps one environment is not sufficient. Tutankham-v5 is included as a medium-difficulty visible environment so baseline ranking is measurable at a modest training budget before transfer is checked on the harder visible and hidden games.
cleanrl/cleanrl/custom_intrinsic_exploration.py [EDITABLE — lines 179–219 only]
1: # Custom sparse-reward Atari exploration benchmark for MLS-Bench.
2: #
3: # FIXED sections: PPO loop, Atari preprocessing, policy/value architecture,
4: # evaluation, logging, and optimizer wiring.
5: # EDITABLE section: IntrinsicBonusModule + mix_advantages.
6:
7: from __fuTool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
4 "start_line": 179,
5 "end_line": 219,
6 "content": "# =====================================================================\n# EDITABLE: intrinsic reward design\n# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Ensemble Dynamics Disagreement intrinsic bonus.\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n \n # Hyperparameters\n self.ensemble_size = 5\n self.feature_dim = 256\n self.hidden_dim = 256\n \n # Shared encoder for observations\n feature_output = 7 * 7 * 64\n self.encoder = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.ReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.ReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.ReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, self.feature_dim)),\n nn.ReLU(),\n )\n \n # Ensemble of forward dynamics predictors\n self.forward_models = nn.ModuleList()\n for _ in range(self.ensemble_size):\n model = nn.Sequential(\n layer_init(nn.Linear(self.feature_dim + action_dim, self.hidden_dim)),\n nn.ReLU(),\n layer_init(nn.Linear(self.hidden_dim, self.hidden_dim)),\n nn.ReLU(),\n layer_init(nn.Linear(self.hidden_dim, self.feature_dim)),\n )\n self.forward_models.append(model)\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def _one_hot(self, actions: torch.Tensor) -> torch.Tensor:\n return F.one_hot(actions.long(), num_classes=self.action_dim).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n # Encode current and next observations\n obs_feat = self.encoder(self._normalize_obs(obs))\n next_feat = self.encoder(self._normalize_obs(next_obs))\n \n # Get predictions from each ensemble member\n predictions = []\n for model in self.forward_models:\n pred = model(torch.cat([obs_feat, self._one_hot(actions)], dim=1))\n predictions.append(pred)\n \n # Stack predictions: [ensemble_size, batch_size, feature_dim]\n pred_stack = torch.stack(predictions, dim=0)\n \n # Compute variance across ensemble members\n variance = torch.var(pred_stack, dim=0) # [batch_size, feature_dim]\n \n # Intrinsic reward = mean variance across feature dimensions\n bonus = variance.mean(dim=1) # [batch_size]\n \n return bonus.detach()\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n # Encode observations\n obs_feat = self.encoder(self._normalize_obs(batch_obs))\n next_feat = self.encoder(self._normalize_obs(batch_next_obs)).detach()\n \n # Compute prediction loss for each ensemble member\n total_loss = 0.0\n for model in self.forward_models:\n pred = model(torch.cat([obs_feat, self._one_hot(batch_actions)], dim=1))\n # MSE loss between predicted and target features\n loss = F.mse_loss(pred, next_feat, reduction='mean')\n total_loss += loss\n \n # Average loss across ensemble\n return total_loss / self.ensemble_size\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages"
7}OK: Replaced lines 179..219 with 136 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–314.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–314 | total: 587 lines]
179 # =====================================================================
180 # EDITABLE: intrinsic reward design
181 # =====================================================================
... (130 more lines) ...
312
313 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
314 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1371 TRAIN_METRICS step=9986048 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9990144 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9994240 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9998336 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=0.0000 nonzero_rate=0.0000 sps=1337 Final evaluation: TEST_METRICS eval_return=0.0000 auc=0.000000 nonzero_rate=0.0000 best_eval_return=0.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=260.0000 avg_intrinsic=0.000000 eval_return=nan nonzero
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
4 "start_line": 179,
5 "end_line": 314,
6 "content": "# =====================================================================\n# EDITABLE: intrinsic reward design\n# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"RND with Episodic Memory intrinsic bonus.\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n \n # Hyperparameters\n self.memory_size = 5000\n self.k_neighbors = 10\n self.episodic_coef = 0.1\n self.rnd_coef = 1.0\n \n # Episodic memory: store random features of visited states\n self.memory = deque(maxlen=self.memory_size)\n \n feature_output = 7 * 7 * 64\n # RND predictor network (trainable)\n self.predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n # RND target network (fixed random)\n self.target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.target.parameters():\n param.requires_grad = False\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.predictor.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(next_obs)\n \n # RND bonus: prediction error\n target_feature = self.target(norm_next)\n predict_feature = self.predictor(norm_next)\n rnd_bonus = ((target_feature - predict_feature).pow(2).sum(1) / 2).detach()\n \n # Episodic bonus: distance to nearest neighbors in memory\n with torch.no_grad():\n target_feature_np = target_feature.cpu().numpy()\n if len(self.memory) > self.k_neighbors:\n # Convert memory to array\n mem_array = np.array(list(self.memory))\n # Compute distances (batch_size, memory_size)\n dists = np.sum((target_feature_np[:, None, :] - mem_array[None, :, :]) ** 2, axis=2)\n # Find k nearest distances\n knn_dists = np.partition(dists, kth=self.k_neighbors-1, axis=1)[:, :self.k_neighbors]\n episodic_bonus = np.mean(knn_dists, axis=1)\n else:\n episodic_bonus = np.ones(target_feature_np.shape[0])\n \n episodic_bonus = torch.from_numpy(episodic_bonus).float().to(self.device)\n \n # Update memory with target features\n for feat in target_feature_np:\n self.memory.append(feat)\n \n # Combined bonus\n bonus = self.rnd_coef * rnd_bonus + self.episodic_coef * episodic_bonus\n return bonus\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(batch_next_obs)\n predict_feature = self.predictor(norm_next)\n target_feature = self.target(norm_next).detach()\n forward_loss = F.mse_loss(predict_feature, target_feature, reduction=\"none\").mean(-1)\n mask = (torch.rand(len(forward_loss), device=self.device) < self.args.update_proportion).float()\n return (forward_loss * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages"
7}OK: Replaced lines 179..314 with 143 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–321.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–321 | total: 594 lines]
179 # =====================================================================
180 # EDITABLE: intrinsic reward design
181 # =====================================================================
... (137 more lines) ...
319
320 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
321 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final) ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=2715648 avg_return=109.1000 avg_intrinsic=0.001810 eval_return=nan nonzero_rate=nan sps=95 TRAIN_METRICS step=2719744 avg_return=107.8500 avg_intrinsic=0.001500 eval_return=nan nonzero_rate=nan sps=95 TRAIN_METRICS step=2723840 avg_return=107.9000 avg_intrinsic=0.002252 eval_return=nan nonzero_rate=nan sps=95 TRAIN_METRICS step=2727936 avg_return=106.5500 avg_intrinsic=0.001654 eval_return=nan nonzero_rate=nan sps=95 TRAIN_METRICS step=2732032 avg_return=106.4000 avg_intrinsic=0.001586 eval_return=nan nonzero_rate=nan sps=95 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=2682880 avg_return=244.5000 avg_intrinsic=0.014342 eval_return=nan nonzero_rate=nan sps=93 TRAIN_METRICS step=2686976 avg_return=244.5000 avg_intrinsic=0.016734 eval_return=nan nonzero_rate=nan
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_intrinsic_exploration.py",
4 "start_line": 179,
5 "end_line": 321,
6 "content": "# =====================================================================\n# EDITABLE: intrinsic reward design\n# =====================================================================\nclass IntrinsicBonusModule(nn.Module):\n \"\"\"Hashed Random Network Distillation with count-based scaling.\"\"\"\n\n def __init__(self, action_dim: int, device: torch.device, args: Args):\n super().__init__()\n self.action_dim = action_dim\n self.device = device\n self.args = args\n self.obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))\n self.reward_rms = RunningMeanStd()\n self.discounted_reward = RewardForwardFilter(args.int_gamma)\n \n # Hyperparameters\n self.projection_dim = 16\n self.bins_per_dim = 10\n self.count_clip = 100\n \n # Count dictionary: map hash string to count\n self.counts = {}\n self.total_visits = 0\n \n feature_output = 7 * 7 * 64\n # RND predictor network (trainable)\n self.predictor = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n nn.ReLU(),\n layer_init(nn.Linear(512, 512)),\n )\n # RND target network (fixed random)\n self.target = nn.Sequential(\n layer_init(nn.Conv2d(1, 32, 8, stride=4)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(32, 64, 4, stride=2)),\n nn.LeakyReLU(),\n layer_init(nn.Conv2d(64, 64, 3, stride=1)),\n nn.LeakyReLU(),\n nn.Flatten(),\n layer_init(nn.Linear(feature_output, 512)),\n )\n for param in self.target.parameters():\n param.requires_grad = False\n \n # Random projection matrix: project 512-dim feature to projection_dim\n self.register_buffer('proj_matrix', torch.randn(512, self.projection_dim) * 0.1)\n # Bin boundaries: assume features are roughly zero-mean, unit variance\n self.bin_boundaries = torch.linspace(-2.0, 2.0, steps=self.bins_per_dim + 1, device=device)\n\n def initialize(self, envs) -> None:\n bootstrap = []\n total_steps = self.args.num_steps * self.args.num_iterations_obs_norm_init\n for _ in range(total_steps):\n random_actions = np.random.randint(0, envs.single_action_space.n, size=(self.args.num_envs,))\n sampled_obs, _, _, _ = envs.step(random_actions)\n bootstrap.append(sampled_obs[:, 3:4, :, :])\n if len(bootstrap) >= self.args.num_steps:\n stacked = np.concatenate(bootstrap, axis=0)\n self.obs_rms.update(stacked)\n bootstrap.clear()\n\n def trainable_parameters(self):\n return list(self.predictor.parameters())\n\n def _normalize_obs(self, obs: torch.Tensor) -> torch.Tensor:\n mean = torch.from_numpy(self.obs_rms.mean).to(self.device)\n var = torch.from_numpy(self.obs_rms.var).to(self.device)\n return ((last_frame(obs) - mean) / torch.sqrt(var)).clip(-5, 5).float()\n\n def update_batch_stats(self, batch_obs: torch.Tensor, batch_next_obs: torch.Tensor) -> None:\n self.obs_rms.update(last_frame(batch_next_obs).cpu().numpy())\n\n def _compute_hash(self, features: torch.Tensor) -> list:\n \"\"\"Convert batch of features to list of hash strings.\"\"\"\n # Project to lower dimension\n projected = torch.mm(features, self.proj_matrix) # [batch, projection_dim]\n # Discretize each dimension\n bins = torch.bucketize(projected, self.bin_boundaries) # [batch, projection_dim]\n # Convert to hash strings\n hash_list = []\n for i in range(bins.shape[0]):\n key = tuple(bins[i].cpu().numpy())\n hash_list.append(str(key))\n return hash_list\n\n def compute_bonus(\n self,\n obs: torch.Tensor,\n next_obs: torch.Tensor,\n actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(next_obs)\n \n # RND bonus: prediction error\n target_feature = self.target(norm_next) # [batch, 512]\n predict_feature = self.predictor(norm_next)\n rnd_bonus = ((target_feature - predict_feature).pow(2).sum(1) / 2).detach() # [batch]\n \n # Count-based scaling: compute hash of target feature\n with torch.no_grad():\n hash_list = self._compute_hash(target_feature)\n count_bonus = []\n for h in hash_list:\n cnt = self.counts.get(h, 0)\n # Update count (clipped)\n self.counts[h] = min(cnt + 1, self.count_clip)\n # Scaling factor: inverse square root of count\n scale = 1.0 / (np.sqrt(cnt + 1))\n count_bonus.append(scale)\n count_bonus = torch.tensor(count_bonus, device=self.device, dtype=torch.float32)\n \n # Combined bonus: RND error weighted by novelty\n bonus = rnd_bonus * count_bonus\n return bonus\n\n def normalize_rollout_rewards(self, rollout_intrinsic: torch.Tensor) -> torch.Tensor:\n discounted = np.stack(\n [self.discounted_reward.update(reward_per_step) for reward_per_step in rollout_intrinsic.cpu().numpy()],\n axis=0,\n )\n flat_discounted = discounted.reshape(-1)\n self.reward_rms.update_from_moments(\n float(flat_discounted.mean()),\n float(flat_discounted.var()),\n int(flat_discounted.size),\n )\n return rollout_intrinsic / float(np.sqrt(self.reward_rms.var + 1e-8))\n\n def loss(\n self,\n batch_obs: torch.Tensor,\n batch_next_obs: torch.Tensor,\n batch_actions: torch.Tensor,\n ) -> torch.Tensor:\n norm_next = self._normalize_obs(batch_next_obs)\n predict_feature = self.predictor(norm_next)\n target_feature = self.target(norm_next).detach()\n forward_loss = F.mse_loss(predict_feature, target_feature, reduction=\"none\").mean(-1)\n mask = (torch.rand(len(forward_loss), device=self.device) < self.args.update_proportion).float()\n return (forward_loss * mask).sum() / torch.clamp(mask.sum(), min=1.0)\n\n\ndef mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:\n return args.ext_coef * ext_advantages + args.int_coef * int_advantages"
7}OK: Replaced lines 179..321 with 154 line(s) in cleanrl/cleanrl/custom_intrinsic_exploration.py. Editable range: 179–332.
[Current file: cleanrl/cleanrl/custom_intrinsic_exploration.py | editable: 179–332 | total: 605 lines]
179 # =====================================================================
180 # EDITABLE: intrinsic reward design
181 # =====================================================================
... (148 more lines) ...
330
331 def mix_advantages(ext_advantages: torch.Tensor, int_advantages: torch.Tensor, args: Args) -> torch.Tensor:
332 return args.ext_coef * ext_advantages + args.int_coef * int_advantages[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=130.4500 avg_intrinsic=0.000943 eval_return=nan nonzero_rate=nan sps=881 TRAIN_METRICS step=9986048 avg_return=130.4500 avg_intrinsic=0.001109 eval_return=nan nonzero_rate=nan sps=881 TRAIN_METRICS step=9990144 avg_return=130.4500 avg_intrinsic=0.001515 eval_return=nan nonzero_rate=nan sps=881 TRAIN_METRICS step=9994240 avg_return=131.2500 avg_intrinsic=0.001152 eval_return=nan nonzero_rate=nan sps=882 TRAIN_METRICS step=9998336 avg_return=132.0500 avg_intrinsic=0.000747 eval_return=104.4000 nonzero_rate=1.0000 sps=859 Final evaluation: TEST_METRICS eval_return=104.4000 auc=49.609785 nonzero_rate=1.0000 best_eval_return=106.2000 ### frostbite-v5 (scripts/frostbit
Tool Input
1{
2 "n": 1
3}[submit] Submitting result from test #1 as final. ### tutankham-v5 (scripts/tutankham.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1371 TRAIN_METRICS step=9986048 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9990144 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9994240 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=1372 TRAIN_METRICS step=9998336 avg_return=97.0500 avg_intrinsic=0.000000 eval_return=0.0000 nonzero_rate=0.0000 sps=1337 Final evaluation: TEST_METRICS eval_return=0.0000 auc=0.000000 nonzero_rate=0.0000 best_eval_return=0.0000 ### frostbite-v5 (scripts/frostbite.sh) Training metrics (last steps): TRAIN_METRICS step=9981952 avg_return=260.0000 avg_intrinsic=0.000000 eval_return=nan nonzero_rate=nan sps=2096 TRAIN_METRICS step=9986048