Agent Conversation: gpt-5.4-pro
Task: rl-value-atari
Online RL: Value-Based Methods for Visual Control (Atari)
Objective
Design and implement a value-based RL algorithm for visual/Atari environments using CNN feature extraction. Your code goes in custom_value_atari.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.
Background
Atari games require learning from raw pixel observations (84x84 grayscale, 4 stacked frames). Value-based methods must learn effective visual representations alongside Q-value estimation. Key challenges include high-dimensional observations, sparse rewards, and memory-efficient experience replay. Different approaches address these through distributional value functions, frame stacking, or architecture innovations.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on Breakout, Pong, BeamRider. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_value_atari.py [EDITABLE — lines 186–249 only]
1: # Custom value-based RL algorithm for Atari -- MLS-Bench
2: #
3: # EDITABLE section: QNetwork head and ValueAlgorithm classes.
4: # FIXED sections: everything else (config, env, buffer, encoder, eval, training loop).
5: import os
6: import random
7: import time
8: from dataclasses import dataclass
9:
10: import gymnasium as gym
11: import numpy as np
12: import torch
13: import torch.nn as nn
14: import torch.nn.functional as F
15: import torch.optim as optim
16: import tyro
17:
18: from cleanrl_utils.atari_wrappers import (
19: ClipRewardEnv,
20: EpisodicLifeEnv,
21: FireResetEnv,
22: MaxAndSkipEnv,
23: NoopResetEnv,
24: )
25: from cleanrl_utils.buffers import ReplayBuffer
26:
27:
28: # =====================================================================
29: # FIXED: Configuration
30: # =====================================================================
31: @dataclass
32: class Args:
33: exp_name: str = os.path.basename(__file__)[: -len(".py")]
34: """the name of this experiment"""
35: seed: int = 1
36: """seed of the experiment"""
37: torch_deterministic: bool = True
38: """if toggled, `torch.backends.cudnn.deterministic=False`"""
39: cuda: bool = True
40: """if toggled, cuda will be enabled by default"""
41:
42: # Algorithm specific arguments
43: env_id: str = "BreakoutNoFrameskip-v4"
44: """the id of the environment""Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_atari.py",
4 "content": "class QNetwork(nn.Module):\n \"\"\"Dueling quantile Q-network with a fixed Nature-DQN encoder.\"\"\"\n\n def __init__(self, envs, n_quantiles=101):\n super().__init__()\n self.n = envs.single_action_space.n\n self.n_quantiles = n_quantiles\n self.encoder = NatureDQNEncoder()\n self.adv_head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles)\n self.val_head = nn.Linear(ENCODER_FEATURE_DIM, n_quantiles)\n\n def get_quantiles(self, x):\n features = self.encoder(x)\n adv = self.adv_head(features).view(len(x), self.n, self.n_quantiles)\n val = self.val_head(features).view(len(x), 1, self.n_quantiles)\n return val + adv - adv.mean(dim=1, keepdim=True)\n\n def forward(self, x):\n return self.get_quantiles(x).mean(dim=2)\n\n\nclass ValueAlgorithm:\n \"\"\"Entropy-adaptive Munchausen Double QR-DQN with dueling quantiles.\"\"\"\n\n def __init__(self, envs, device, args):\n self.device = device\n self.gamma = args.gamma\n self.sync_tau = args.tau\n self.target_network_frequency = args.target_network_frequency\n self.n_actions = envs.single_action_space.n\n self.n_quantiles = 101\n self.kappa = 1.0\n self.entropy_temp = 0.03\n self.m_alpha = 0.9\n self.log_pi_min = -1.0\n\n self.q_network = QNetwork(envs, n_quantiles=self.n_quantiles).to(device)\n self.target_network = QNetwork(envs, n_quantiles=self.n_quantiles).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate, eps=0.01 / args.batch_size)\n\n tau = torch.arange(1, self.n_quantiles + 1, dtype=torch.float32, device=device)\n self.quantile_tau = (2 * tau - 1) / (2 * self.n_quantiles)\n\n def _policy(self, q_values):\n logits = q_values / self.entropy_temp\n logits = logits - logits.max(dim=1, keepdim=True).values\n log_pi = F.log_softmax(logits, dim=1)\n return log_pi.exp(), log_pi\n\n def select_action(self, obs, epsilon):\n del epsilon\n q_values = self.q_network(torch.as_tensor(obs, device=self.device))\n return torch.argmax(q_values, dim=1).cpu().numpy()\n\n def update(self, batch, global_step):\n actions = batch.actions.long().flatten()\n rewards = batch.rewards.flatten()\n dones = batch.dones.flatten()\n batch_idx = torch.arange(actions.shape[0], device=self.device)\n\n with torch.no_grad():\n next_q_online = self.q_network(batch.next_observations)\n next_pi, next_log_pi = self._policy(next_q_online)\n next_quantiles_all = self.target_network.get_quantiles(batch.next_observations)\n next_actions = next_q_online.argmax(dim=1)\n greedy_quantiles = next_quantiles_all[batch_idx, next_actions]\n soft_quantiles = (next_pi.unsqueeze(-1) * (next_quantiles_all - self.entropy_temp * next_log_pi.unsqueeze(-1))).sum(dim=1)\n entropy = -(next_pi * next_log_pi).sum(dim=1, keepdim=True) / np.log(self.n_actions)\n mix = (1.0 - entropy).clamp(0.0, 1.0)\n target_quantiles = mix * greedy_quantiles + (1.0 - mix) * soft_quantiles\n\n curr_q = self.q_network(batch.observations)\n _, curr_log_pi = self._policy(curr_q)\n munchausen = curr_log_pi.gather(1, actions.unsqueeze(1)).squeeze(1).clamp(min=self.log_pi_min, max=0.0)\n target_quantiles = (rewards + self.m_alpha * self.entropy_temp * munchausen).unsqueeze(1) + self.gamma * (1.0 - dones).unsqueeze(1) * target_quantiles\n\n current_quantiles = self.q_network.get_quantiles(batch.observations)[batch_idx, actions]\n td_errors = target_quantiles.unsqueeze(1) - current_quantiles.unsqueeze(2)\n abs_td = td_errors.abs()\n huber = torch.where(abs_td <= self.kappa, 0.5 * td_errors.pow(2), self.kappa * (abs_td - 0.5 * self.kappa))\n tau = self.quantile_tau.view(1, self.n_quantiles, 1)\n loss = (torch.abs(tau - (td_errors.detach() < 0).float()) * huber).mean()\n\n self.optimizer.zero_grad()\n loss.backward()\n nn.utils.clip_grad_norm_(self.q_network.parameters(), 10.0)\n self.optimizer.step()\n\n if global_step % self.target_network_frequency == 0:\n for target_param, q_param in zip(self.target_network.parameters(), self.q_network.parameters()):\n target_param.data.copy_(self.sync_tau * q_param.data + (1.0 - self.sync_tau) * target_param.data)\n\n return {\n \"td_loss\": loss.item(),\n \"q_values\": current_quantiles.mean().item(),\n \"target_mix\": mix.mean().item(),\n \"entropy\": entropy.mean().item(),\n }\n",
5 "after_line": 0,
6 "start_line": 186,
7 "end_line": 249
8}OK: Replaced lines 186..249 with 99 line(s) in cleanrl/cleanrl/custom_value_atari.py. Editable range: 186–284. [Current file: cleanrl/cleanrl/custom_value_atari.py | editable: 186–284 | total: 394 lines] 183| # ===================================================================== 184| # EDITABLE: QNetwork head and ValueAlgorithm 185| # ===================================================================== 186 class QNetwork(nn.Module): 187 """Dueling quantile Q-network with a fixed Nature-DQN encoder.""" 188 189 def __init__(self, envs, n_quantiles=101): 190 super().__init__() 191 self.n = envs.single_action_space.n 192 self.n_quantiles = n_quantiles 193 self.encoder = NatureDQNEncoder() 194 self.adv_head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles) 195 self.val_head = nn.Linear(ENCODER_FEATURE_DIM, n_quantiles) 196 197 def get_quantiles(self, x): 198
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### breakout-v4 (scripts/breakout.sh) Training metrics (last steps): TRAIN_METRICS step=3795000 td_loss=0.0032 q_values=1.3641 target_mix=0.3461 entropy=0.6539 TRAIN_METRICS step=3796000 td_loss=0.0025 q_values=1.3234 target_mix=0.3908 entropy=0.6092 TRAIN_METRICS step=3797000 td_loss=0.0037 q_values=1.1965 target_mix=0.4326 entropy=0.5674 TRAIN_METRICS step=3798000 td_loss=0.0031 q_values=1.3204 target_mix=0.4054 entropy=0.5946 TRAIN_METRICS step=3799000 td_loss=0.0031 q_values=1.3250 target_mix=0.3515 entropy=0.6485 Evaluation (breakout-v4): Eval episodic_return: 32.90 Eval episodic_return: 15.40 Eval episodic_return: 4.40 Final eval return: 4.40 ### seaquest-v4 (scripts/seaquest.sh) Training metrics (last steps): TRAIN_METRICS step=4995000 td_loss=0.0033 q_values=6.1287 target_mix=0.0443 entropy=0.9557 TRAIN_METRICS step=4996000 td_loss=0.0015 q_values=5.8585 target_mix=0.0460 entropy=0.9540 TRAIN_M
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_atari.py",
4 "content": "class QNetwork(nn.Module):\n \"\"\"Quantile Q-network with a tail-trimmed action-value readout.\"\"\"\n\n def __init__(self, envs, n_quantiles=200, trim_frac=0.10):\n super().__init__()\n self.n = envs.single_action_space.n\n self.n_quantiles = n_quantiles\n self.trim = int(n_quantiles * trim_frac)\n self.encoder = NatureDQNEncoder()\n self.head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles)\n\n def get_quantiles(self, x):\n features = self.encoder(x)\n return self.head(features).view(len(x), self.n, self.n_quantiles)\n\n def aggregate_quantiles(self, quantiles):\n sorted_quantiles = torch.sort(quantiles, dim=2).values\n mean_q = sorted_quantiles.mean(dim=2)\n if self.trim > 0 and 2 * self.trim < self.n_quantiles:\n trimmed_q = sorted_quantiles[:, :, self.trim : self.n_quantiles - self.trim].mean(dim=2)\n return 0.5 * mean_q + 0.5 * trimmed_q\n return mean_q\n\n def forward(self, x):\n return self.aggregate_quantiles(self.get_quantiles(x))\n\n\nclass ValueAlgorithm:\n \"\"\"Ordered Double QR-DQN with tail-trimmed action values.\n\n Novelty:\n 1) action selection uses a robust tail-trimmed quantile aggregation instead of a plain mean;\n 2) quantile predictions are trained with an ordering penalty to reduce quantile crossing.\n \"\"\"\n\n def __init__(self, envs, device, args):\n self.device = device\n self.gamma = args.gamma\n self.tau = args.tau\n self.target_network_frequency = args.target_network_frequency\n self.n_quantiles = 200\n self.kappa = 1.0\n self.order_weight = 1e-3\n\n self.q_network = QNetwork(envs, n_quantiles=self.n_quantiles, trim_frac=0.10).to(device)\n self.target_network = QNetwork(envs, n_quantiles=self.n_quantiles, trim_frac=0.10).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate, eps=0.01 / args.batch_size)\n\n tau = torch.arange(1, self.n_quantiles + 1, dtype=torch.float32, device=device)\n self.quantile_tau = (2 * tau - 1) / (2 * self.n_quantiles)\n\n def select_action(self, obs, epsilon):\n del epsilon\n q_values = self.q_network(torch.as_tensor(obs, device=self.device))\n return torch.argmax(q_values, dim=1).cpu().numpy()\n\n def update(self, batch, global_step):\n batch_size = batch.actions.shape[0]\n batch_idx = torch.arange(batch_size, device=self.device)\n actions = batch.actions.long().flatten()\n rewards = batch.rewards.flatten()\n dones = batch.dones.flatten()\n\n current_quantiles_all = self.q_network.get_quantiles(batch.observations)\n current_quantiles = current_quantiles_all[batch_idx, actions]\n\n with torch.no_grad():\n next_quantiles_online = self.q_network.get_quantiles(batch.next_observations)\n next_q_online = self.q_network.aggregate_quantiles(next_quantiles_online)\n next_actions = next_q_online.argmax(dim=1)\n\n next_quantiles_target = self.target_network.get_quantiles(batch.next_observations)\n target_quantiles = next_quantiles_target[batch_idx, next_actions]\n target_quantiles = rewards.unsqueeze(1) + self.gamma * (1.0 - dones).unsqueeze(1) * target_quantiles\n\n td_errors = target_quantiles.unsqueeze(1) - current_quantiles.unsqueeze(2)\n abs_td = td_errors.abs()\n huber = torch.where(\n abs_td <= self.kappa,\n 0.5 * td_errors.pow(2),\n self.kappa * (abs_td - 0.5 * self.kappa),\n )\n tau = self.quantile_tau.view(1, self.n_quantiles, 1)\n qr_loss = (torch.abs(tau - (td_errors.detach() < 0).float()) * huber).mean()\n\n order_violation = F.relu(current_quantiles_all[:, :, :-1] - current_quantiles_all[:, :, 1:])\n order_loss = order_violation.pow(2).mean()\n loss = qr_loss + self.order_weight * order_loss\n\n self.optimizer.zero_grad()\n loss.backward()\n nn.utils.clip_grad_norm_(self.q_network.parameters(), 10.0)\n self.optimizer.step()\n\n if global_step % self.target_network_frequency == 0:\n for target_param, q_param in zip(self.target_network.parameters(), self.q_network.parameters()):\n target_param.data.copy_(self.tau * q_param.data + (1.0 - self.tau) * target_param.data)\n\n q_values = self.q_network.aggregate_quantiles(current_quantiles_all)\n chosen_q = q_values[batch_idx, actions]\n return {\n \"td_loss\": loss.item(),\n \"qr_loss\": qr_loss.item(),\n \"order_loss\": order_loss.item(),\n \"q_values\": chosen_q.mean().item(),\n }\n",
5 "after_line": 0,
6 "start_line": 186,
7 "end_line": 284
8}OK: Replaced lines 186..284 with 107 line(s) in cleanrl/cleanrl/custom_value_atari.py. Editable range: 186–292. [Current file: cleanrl/cleanrl/custom_value_atari.py | editable: 186–292 | total: 402 lines] 183| # ===================================================================== 184| # EDITABLE: QNetwork head and ValueAlgorithm 185| # ===================================================================== 186 class QNetwork(nn.Module): 187 """Quantile Q-network with a tail-trimmed action-value readout.""" 188 189 def __init__(self, envs, n_quantiles=200, trim_frac=0.10): 190 super().__init__() 191 self.n = envs.single_action_space.n 192 self.n_quantiles = n_quantiles 193 self.trim = int(n_quantiles * trim_frac) 194 self.encoder = NatureDQNEncoder() 195 self.head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles) 196 197 def get_quantiles(self, x): 198
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### breakout-v4 (scripts/breakout.sh) Training metrics (last steps): TRAIN_METRICS step=3163000 td_loss=0.0047 qr_loss=0.0047 order_loss=0.0004 q_values=1.1888 TRAIN_METRICS step=3164000 td_loss=0.0054 qr_loss=0.0054 order_loss=0.0003 q_values=1.1429 TRAIN_METRICS step=3165000 td_loss=0.0068 qr_loss=0.0068 order_loss=0.0003 q_values=1.1546 TRAIN_METRICS step=3166000 td_loss=0.0049 qr_loss=0.0049 order_loss=0.0003 q_values=1.0769 TRAIN_METRICS step=3167000 td_loss=0.0044 qr_loss=0.0044 order_loss=0.0003 q_values=1.1700 Evaluation (breakout-v4): Eval episodic_return: 7.30 Eval episodic_return: 3.00 Eval episodic_return: 6.00 Final eval return: 6.00 ### seaquest-v4 (scripts/seaquest.sh) Training metrics (last steps): TRAIN_METRICS step=4995000 td_loss=0.0031 qr_loss=0.0031 order_loss=0.0002 q_values=2.8647 TRAIN_METRICS step=4996000 td_loss=0.0038 qr_loss=0.0038 order_loss=0.0002 q_values=2.8166 TRAIN_METRICS step=4997
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_atari.py",
4 "content": "class QNetwork(nn.Module):\n \"\"\"Quantile Q-network: fixed Nature-DQN encoder + quantile head.\"\"\"\n\n def __init__(self, envs, n_quantiles=200):\n super().__init__()\n self.n_quantiles = n_quantiles\n self.n = envs.single_action_space.n\n self.encoder = NatureDQNEncoder()\n self.head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles)\n\n def forward(self, x):\n quantiles = self.get_quantiles(x)\n return quantiles.mean(dim=2)\n\n def get_quantiles(self, x):\n features = self.encoder(x)\n return self.head(features).view(len(x), self.n, self.n_quantiles)\n\n\nclass ValueAlgorithm:\n \"\"\"Double QR-DQN.\n\n Novelty vs. the reference QR-DQN: the online network selects the bootstrap\n action while the target network evaluates its quantile distribution. This\n brings Double-Q bias reduction into distributional quantile regression.\n \"\"\"\n\n def __init__(self, envs, device, args):\n self.device = device\n self.gamma = args.gamma\n self.tau = args.tau\n self.target_network_frequency = args.target_network_frequency\n self.n_quantiles = 200\n self.kappa = 1.0\n\n self.q_network = QNetwork(envs, n_quantiles=self.n_quantiles).to(device)\n self.target_network = QNetwork(envs, n_quantiles=self.n_quantiles).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate, eps=0.01 / args.batch_size)\n\n tau = torch.arange(1, self.n_quantiles + 1, dtype=torch.float32, device=device)\n self.quantile_tau = (2 * tau - 1) / (2 * self.n_quantiles)\n\n def select_action(self, obs, epsilon):\n del epsilon\n q_values = self.q_network(torch.as_tensor(obs, device=self.device))\n return torch.argmax(q_values, dim=1).cpu().numpy()\n\n def update(self, batch, global_step):\n batch_size = batch.observations.shape[0]\n batch_idx = torch.arange(batch_size, device=self.device)\n actions = batch.actions.long().flatten()\n rewards = batch.rewards.flatten()\n dones = batch.dones.flatten()\n\n with torch.no_grad():\n next_actions = self.q_network(batch.next_observations).argmax(dim=1)\n next_quantiles = self.target_network.get_quantiles(batch.next_observations)\n next_quantiles_best = next_quantiles[batch_idx, next_actions]\n target_quantiles = rewards.unsqueeze(1) + self.gamma * (1.0 - dones).unsqueeze(1) * next_quantiles_best\n\n current_quantiles_all = self.q_network.get_quantiles(batch.observations)\n current_quantiles = current_quantiles_all[batch_idx, actions]\n\n td_errors = target_quantiles.unsqueeze(1) - current_quantiles.unsqueeze(2)\n abs_td_errors = td_errors.abs()\n huber_loss = torch.where(\n abs_td_errors <= self.kappa,\n 0.5 * td_errors.pow(2),\n self.kappa * (abs_td_errors - 0.5 * self.kappa),\n )\n tau = self.quantile_tau.view(1, self.n_quantiles, 1)\n loss = (torch.abs(tau - (td_errors.detach() < 0).float()) * huber_loss).mean()\n\n self.optimizer.zero_grad()\n loss.backward()\n nn.utils.clip_grad_norm_(self.q_network.parameters(), 10.0)\n self.optimizer.step()\n\n if global_step % self.target_network_frequency == 0:\n for target_param, q_param in zip(self.target_network.parameters(), self.q_network.parameters()):\n target_param.data.copy_(self.tau * q_param.data + (1.0 - self.tau) * target_param.data)\n\n return {\n \"td_loss\": loss.item(),\n \"q_values\": current_quantiles.mean().item(),\n }\n",
5 "after_line": 0,
6 "start_line": 186,
7 "end_line": 292
8}OK: Replaced lines 186..292 with 87 line(s) in cleanrl/cleanrl/custom_value_atari.py. Editable range: 186–272. [Current file: cleanrl/cleanrl/custom_value_atari.py | editable: 186–272 | total: 382 lines] 183| # ===================================================================== 184| # EDITABLE: QNetwork head and ValueAlgorithm 185| # ===================================================================== 186 class QNetwork(nn.Module): 187 """Quantile Q-network: fixed Nature-DQN encoder + quantile head.""" 188 189 def __init__(self, envs, n_quantiles=200): 190 super().__init__() 191 self.n_quantiles = n_quantiles 192 self.n = envs.single_action_space.n 193 self.encoder = NatureDQNEncoder() 194 self.head = nn.Linear(ENCODER_FEATURE_DIM, self.n * n_quantiles) 195 196 def forward(self, x): 197 quantiles = self.get_quantiles(x) 198 return quantiles.mean(dim=2)
[Test #3] (0 tests remaining) ## Seed 42 ### breakout-v4 (scripts/breakout.sh) Training metrics (last steps): TRAIN_METRICS step=3595000 td_loss=0.0089 q_values=1.3381 TRAIN_METRICS step=3596000 td_loss=0.0058 q_values=1.5519 TRAIN_METRICS step=3597000 td_loss=0.0051 q_values=1.3865 TRAIN_METRICS step=3598000 td_loss=0.0054 q_values=1.4923 TRAIN_METRICS step=3599000 td_loss=0.0058 q_values=1.4492 Evaluation (breakout-v4): Eval episodic_return: 4.20 Eval episodic_return: 8.20 Eval episodic_return: 2.90 Final eval return: 2.90 ### seaquest-v4 (scripts/seaquest.sh) Training metrics (last steps): TRAIN_METRICS step=4995000 td_loss=0.0027 q_values=2.5915 TRAIN_METRICS step=4996000 td_loss=0.0024 q_values=2.5367 TRAIN_METRICS step=4997000 td_loss=0.0026 q_values=2.3737 TRAIN_METRICS step=4998000 td_loss=0.0027 q_values=2.3972 TRAIN_METRICS step=4999000 td_loss=0.0030 q_values=2.5836 Evaluation (seaquest-v4): Eval episodic_return: 804.00 Eval episodic_return: 792.00 Eval episodic_return:
[Test #3] (0 tests remaining) ## Seed 42 ### breakout-v4 (scripts/breakout.sh) Training metrics (last steps): TRAIN_METRICS step=4077000 td_loss=0.0028 q_values=1.4801 TRAIN_METRICS step=4078000 td_loss=0.0042 q_values=1.5353 TRAIN_METRICS step=4079000 td_loss=0.0059 q_values=1.3247 TRAIN_METRICS step=4080000 td_loss=0.0060 q_values=1.4272 TRAIN_METRICS step=4081000 td_loss=0.0079 q_values=1.5004 Evaluation (breakout-v4): Eval episodic_return: 11.10 Eval episodic_return: 5.00 Eval episodic_return: 3.20 Final eval return: 3.20 ### seaquest-v4 (scripts/seaquest.sh) Training metrics (last steps): TRAIN_METRICS step=4995000 td_loss=0.0027 q_values=2.5915 TRAIN_METRICS step=4996000 td_loss=0.0024 q_values=2.5367 TRAIN_METRICS step=4997000 td_loss=0.0026 q_values=2.3737 TRAIN_METRICS step=4998000 td_loss=0.0027 q_values=2.3972 TRAIN_METRICS step=4999000 td_loss=0.0030 q_values=2.5836 Evaluation (seaquest-v4): Eval episodic_return: 804.00 Eval episodic_return: 792.00 Eval episodic_return: