Agent Conversation: gemini-3.1-pro-preview
Task: llm-pretrain-residual
LLM Pretraining: Residual Connection Strategy
Research Question
Improve the residual connection strategy in a GPT-style language model. The current architecture uses standard Pre-LN residual connections (x + sublayer(x)) in each transformer block. Your task is to redesign how information flows through the residual stream across layers to achieve lower validation loss.
Background
Standard Residual Connections
The default GPT architecture uses simple additive residual connections in each block:
x = x + self.attn(self.ln_1(x)) # attention sublayer
x = x + self.mlp(self.ln_2(x)) # MLP sublayer
While effective, this fixed accumulation pattern may not be optimal for deep networks. The residual stream is the primary information highway through the model, and its design critically affects gradient flow, feature reuse, and training dynamics.
Research Directions
Several recent works have proposed improvements to residual connections:
- Per-layer residual scaling: Learnable scalars that modulate the residual stream at each layer (inspired by modded-nanogpt, ReZero, SkipInit).
- Initial embedding blending: Blending the initial token embedding back at each layer to preserve token identity (x0 residual connections).
- Hyper-Connections: Maintaining m parallel residual streams with learned transition matrices for richer information flow across layers (Zhu et al., 2025).
- Attention Residuals: Using softmax attention over all previous layer outputs to dynamically select which representations to combine (Kimi Team, 2026).
What You Can Modify
Block Class (lines 88-99)
The Block class defines per-block residual behavior. You can change how attention and MLP outputs are combined with the residual stream within each block.
Residual Stream Parameters (lines 128-130)
Add custom parameters to GPT.__init__ for your residual connection strategy (e.g., per-layer scalars, transition matrices, query vectors).
Block Loop in GPT.forward (lines 162-164)
The main loop that iterates through transformer blocks. You can modify how blocks are called and how their outputs are accumulated (e.g., multi-stream processing, attention over layer outputs).
Optimizer Configuration (lines 175-192)
The configure_optimizers method. If you add new parameters, you may want to assign them to appropriate optimizer groups with custom learning rates and weight decay.
Training Hyperparameters (line 251)
The CONFIG_OVERRIDES dictionary for adjusting learning rate, weight decay, etc.
Note: The CausalSelfAttention, MLP, LayerNorm, and GPTConfig classes are fixed. The Block.forward signature must accept x and return a tensor of the same shape. The GPT.forward must accept (idx, targets=None) and return (logits, loss).
Evaluation
- Primary metric: Validation loss (
val_loss, lower is better) - Secondary metrics: Perplexity on WikiText-2 and LAM
Tool Input
1{
2 "op": "replace",
3 "after_line": 128,
4 "end_line": 130,
5 "filename": "nanoGPT/custom_pretrain.py",
6 "content": " # ── Token-wise Dynamic Attention Residuals (TDAR) ──\n self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))\n self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))\n self.attnres_query_out = nn.Parameter(torch.zeros(config.n_embd))\n self.attnres_bias_out = nn.Parameter(torch.zeros(config.n_layer + 1))\n with torch.no_grad():\n for i in range(config.n_layer):\n self.attnres_bias[i, i] = 5.0\n self.attnres_bias_out[-1] = 5.0\n nn.init.normal_(self.attnres_queries, std=0.02)\n nn.init.normal_(self.attnres_query_out, std=0.02)",
7 "start_line": 128
8}OK: Replaced lines 128..130 with 11 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–172, 183–200, 259–259.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–172, 183–200, 259–259 | total: 449 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
1
Tool Input
1{
2 "end_line": 172,
3 "after_line": 170,
4 "content": " # ── TDAR: dynamic token-wise routing over layer history ──\n H = [x]\n fast_rms = lambda t: t * torch.rsqrt(t.pow(2).mean(-1, keepdim=True) + 1e-5)\n H_norm = [fast_rms(x)]\n for i, block in enumerate(self.transformer.h):\n if i > 0:\n stacked_norms = torch.stack(H_norm, dim=0)\n logits = torch.einsum('d, n b t d -> n b t', self.attnres_queries[i], stacked_norms)\n logits = logits + self.attnres_bias[i, :i+1].view(-1, 1, 1)\n weights = logits.softmax(dim=0)\n stacked_H = torch.stack(H, dim=0)\n x_in = torch.einsum('n b t, n b t d -> b t d', weights, stacked_H)\n else:\n x_in = H[0]\n x_out = block(x_in)\n H.append(x_out)\n H_norm.append(fast_rms(x_out))\n \n stacked_norms = torch.stack(H_norm, dim=0)\n logits = torch.einsum('d, n b t d -> n b t', self.attnres_query_out, stacked_norms)\n logits = logits + self.attnres_bias_out.view(-1, 1, 1)\n weights = logits.softmax(dim=0)\n x = torch.einsum('n b t, n b t d -> b t d', weights, torch.stack(H, dim=0))",
5 "filename": "nanoGPT/custom_pretrain.py",
6 "start_line": 170,
7 "op": "replace"
8}OK: Replaced lines 170..172 with 23 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–192, 203–220, 279–279.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–192, 203–220, 279–279 | total: 469 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
1
Tool Input
1{
2 "start_line": 203,
3 "content": " def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):\n tdar_params = [self.attnres_queries, self.attnres_bias, self.attnres_query_out, self.attnres_bias_out]\n tdar_ids = {id(p) for p in tdar_params}\n param_dict = {pn: p for pn, p in self.named_parameters() if p.requires_grad}\n decay_params = [p for n, p in param_dict.items() if p.dim() >= 2 and id(p) not in tdar_ids]\n nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2 and id(p) not in tdar_ids]\n optim_groups = [\n {'params': decay_params, 'weight_decay': weight_decay},\n {'params': nodecay_params, 'weight_decay': 0.0},\n {'params': tdar_params, 'lr': learning_rate * 0.1, 'weight_decay': 0.0},\n ]\n num_decay_params = sum(p.numel() for p in decay_params)\n num_nodecay_params = sum(p.numel() for p in nodecay_params)\n num_tdar_params = sum(p.numel() for p in tdar_params)\n print(f\"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters\")\n print(f\"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters\")\n print(f\"num TDAR parameter tensors: {len(tdar_params)}, with {num_tdar_params:,} parameters\")\n fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters\n use_fused = fused_available and device_type == 'cuda'\n extra_args = dict(fused=True) if use_fused else dict()\n optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)\n print(f\"using fused AdamW: {use_fused}\")\n return optimizer",
4 "after_line": 203,
5 "filename": "nanoGPT/custom_pretrain.py",
6 "op": "replace",
7 "end_line": 220
8}OK: Replaced lines 203..220 with 23 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–192, 203–225, 284–284.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–192, 203–225, 284–284 | total: 474 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
1
[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### gpt-345m (scripts/gpt_345m.sh) Training metrics (last steps): TRAIN_METRICS: step=0, train_loss=11.0388, val_loss=11.0242 ### lm-eval-345m (scripts/lm_eval_345m.sh) [COMMAND FAILED exit=1] ERROR: Checkpoint not found: /scratch/gpfs/CHIJ/bohan/MLS-Bench/.saves/llm-pretrain-residual/gemini-3.1-pro-preview_20260413_072534/seed_42/ckpt_gpt-345m.pt
Tool Input
1{
2 "op": "replace",
3 "after_line": 170,
4 "start_line": 171,
5 "end_line": 192,
6 "content": " # ── TDAR: dynamic token-wise routing over layer history ──\n H = [x]\n H_norm = [F.rms_norm(x, (x.size(-1),))]\n for i, block in enumerate(self.transformer.h):\n if i > 0:\n q = self.attnres_queries[i]\n logits_list = []\n for j in range(i + 1):\n logit = (H_norm[j] * q).sum(dim=-1) + self.attnres_bias[i, j]\n logits_list.append(logit)\n logits = torch.stack(logits_list, dim=0)\n weights = logits.softmax(dim=0)\n \n x_in = weights[0].unsqueeze(-1) * H[0]\n for j in range(1, i + 1):\n x_in = x_in + weights[j].unsqueeze(-1) * H[j]\n else:\n x_in = H[0]\n \n x_out = block(x_in)\n H.append(x_out)\n H_norm.append(F.rms_norm(x_out, (x_out.size(-1),)))\n \n q_out = self.attnres_query_out\n logits_list = []\n for j in range(len(H)):\n logit = (H_norm[j] * q_out).sum(dim=-1) + self.attnres_bias_out[j]\n logits_list.append(logit)\n logits = torch.stack(logits_list, dim=0)\n weights = logits.softmax(dim=0)\n \n x = weights[0].unsqueeze(-1) * H[0]\n for j in range(1, len(H)):\n x = x + weights[j].unsqueeze(-1) * H[j]",
7 "filename": "nanoGPT/custom_pretrain.py"
8}OK: Replaced lines 171..192 with 34 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–204, 215–237, 296–296.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–204, 215–237, 296–296 | total: 486 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
1
Tool Input
1{
2 "content": " x = self.transformer.ln_f(x)\n if targets is not None:\n logits = self.lm_head(x)\n loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)\n if self.training:\n dummy = self.attnres_queries.sum() + self.attnres_bias.sum() + self.attnres_query_out.sum() + self.attnres_bias_out.sum()\n loss = loss + 0.0 * dummy\n else:\n logits = self.lm_head(x[:, [-1], :])\n loss = None\n return logits, loss",
3 "filename": "nanoGPT/custom_pretrain.py",
4 "op": "replace",
5 "end_line": 212,
6 "after_line": 204,
7 "start_line": 205
8}ERROR: Lines 205..212 exceed the editable range. You may only edit lines 88–99, 128–138, 170–204, 215–237, 296–296.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–204, 215–237, 296–296 | total: 486 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
170 # ── T
Tool Input
1{
2 "filename": "nanoGPT/custom_pretrain.py",
3 "after_line": 201,
4 "end_line": 204,
5 "op": "replace",
6 "content": " x = weights[0].unsqueeze(-1) * H[0]\n for j in range(1, len(H)):\n x = x + weights[j].unsqueeze(-1) * H[j]\n \n dummy = self.attnres_queries.sum() + self.attnres_bias.sum() + self.attnres_query_out.sum() + self.attnres_bias_out.sum()\n x = x + 0.0 * dummy",
7 "start_line": 202
8}OK: Replaced lines 202..204 with 6 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–207, 218–240, 299–299.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–207, 218–240, 299–299 | total: 489 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
17
Tool Input
1{
2 "after_line": 170,
3 "end_line": 171,
4 "filename": "nanoGPT/custom_pretrain.py",
5 "op": "replace",
6 "start_line": 170,
7 "content": " # ── TDAR: dynamic token-wise routing over layer history ──"
8}OK: Replaced lines 170..171 with 1 line(s) in nanoGPT/custom_pretrain.py. Editable range: 88–99, 128–138, 170–206, 217–239, 298–298.
[Current file: nanoGPT/custom_pretrain.py | editable: 88–99, 128–138, 170–206, 217–239, 298–298 | total: 488 lines]
88 class Block(nn.Module):
89 def __init__(self, config):
90 super().__init__()
... (6 more lines) ...
97 x = x + self.attn(self.ln_1(x))
98 x = x + self.mlp(self.ln_2(x))
99 return x
...
128 # ── Token-wise Dynamic Attention Residuals (TDAR) ──
129 self.attnres_queries = nn.Parameter(torch.zeros(config.n_layer, config.n_embd))
130 self.attnres_bias = nn.Parameter(torch.zeros(config.n_layer, config.n_layer))
... (5 more lines) ...
136 self.attnres_bias_out[-1] = 5.0
137 nn.init.normal_(self.attnres_queries, std=0.02)
138 nn.init.normal_(self.attnres_query_out, std=0.02)
...
17