Agent Conversation: gpt-5.4

Model: gpt-5.4
Total steps: 5
Messages: 12
UserStep 0

Task: llm-pretrain-loss

LLM Pretraining: Loss Function Optimization

Research Question

Design an improved loss function for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to standard cross-entropy.

What You Can Modify

The compute_loss function (lines 189-191) in custom_pretrain.py:

  • Loss function formulation (default: standard cross-entropy)
  • Logit processing (e.g., softcapping, temperature scaling)
  • Regularization terms (e.g., z-loss, entropy penalties)
  • Label distribution modifications (e.g., label smoothing)

Note: The function signature compute_loss(logits, targets) must be preserved. logits has shape (B, T, V) and targets has shape (B, T). The function is called inside the model's forward pass during training.

Evaluation

  • Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
  • Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
  • Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
  • Training: 13535 iterations, BSZ=64, GA=8, 2-GPU DDP
  • Hardware: H200 GPU

nanoGPT/custom_pretrain.py [EDITABLE — lines 188–191, lines 247–249 only]

     1: """Custom GPT-2 Pretraining Script
     2: Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
     3: """
     4: 
     5: import math
     6: import inspect
     7: import os
     8: import time
     9: from contextlib import nullcontext
    10: from dataclasses import dataclass
    11: 
    12: import numpy as np
    13: import torch
    14: import torch.nn as nn
    15: from torch.nn import functional as F
    16: 
    17: # ============================================================================
    18: # Model Components
    19: # ============================================================================
    20: 
    21: # ── Normalization ──────────────────────────────────────────────────────────
    22: class LayerNorm(nn.Module):
    23:     """LayerNorm but with an optional bias."""
    24:     def __init__(self, ndim, bias):
    25:         super().__init__()
    26:         self.weight = nn.Parameter(torch.ones(ndim))
    27:         self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
    28: 
    29:     def forward(self, input):
    30:         return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
    31: 
    32: # ── Self-Attention ─────────────────────────────────────────────────────────
    33: class CausalSelfAttention(nn.Module):
    34:     def __init__(self, config):
    35:         super().__init__()
    36:         assert config.n_embd % config.n_head == 0
    37:         self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
    38:         self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
    39:         self.attn_dropout = nn.Dropout(config.dr
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "nanoGPT/custom_pretrain.py",
4  "start_line": 188,
5  "end_line": 191,
6  "content": "def compute_loss(logits, targets):\n    \"\"\"Trust-calibrated CE: training-only adaptive smoothing + confidence penalty + z-loss.\n\n    Idea:\n    1) Estimate per-token confidence from the model's detached top-1 probability.\n    2) Apply more label smoothing to overconfident predictions and less to uncertain ones.\n    3) Add a small confidence penalty (negative entropy) and z-loss to control logit scale.\n    4) During evaluation, fall back to plain CE so val_loss stays directly comparable.\n    \"\"\"\n    flat_logits = logits.view(-1, logits.size(-1)).float()\n    flat_targets = targets.view(-1)\n    if not torch.is_grad_enabled():\n        return F.cross_entropy(flat_logits, flat_targets, ignore_index=-1)\n\n    mask = flat_targets != -1\n    valid_logits = flat_logits[mask]\n    valid_targets = flat_targets[mask]\n\n    with torch.no_grad():\n        probs_detached = F.softmax(valid_logits, dim=-1)\n        top1_conf = probs_detached.max(dim=-1).values\n        # Adaptive smoothing: stronger on confident/easy tokens, weaker on uncertain ones.\n        eps = 0.02 + 0.06 * top1_conf\n\n    log_probs = F.log_softmax(valid_logits, dim=-1)\n    nll = -log_probs.gather(1, valid_targets.unsqueeze(1)).squeeze(1)\n    smooth_loss = -log_probs.mean(dim=-1)\n    ce = ((1.0 - eps) * nll + eps * smooth_loss).mean()\n\n    probs = log_probs.exp()\n    entropy = -(probs * log_probs).sum(dim=-1).mean()\n    log_z = torch.logsumexp(valid_logits, dim=-1)\n    z_reg = (log_z ** 2).mean()\n\n    return ce - 1e-3 * entropy + 5e-5 * z_reg"
7}
Tool ResultStep 1
OK: Replaced lines 188..191 with 35 line(s) in nanoGPT/custom_pretrain.py. Editable range: 188–222, 278–280.

[Current file: nanoGPT/custom_pretrain.py | editable: 188–222, 278–280 | total: 470 lines]
   188  def compute_loss(logits, targets):
   189      """Trust-calibrated CE: training-only adaptive smoothing + confidence penalty + z-loss.
   190  
       ... (29 more lines) ...
   220      z_reg = (log_z ** 2).mean()
   221  
   222      return ce - 1e-3 * entropy + 5e-5 * z_reg
...
   278      # CONFIG_OVERRIDES: override training hyperparameters for your method.
   279      # Allowed keys: learning_rate, weight_decay, warmup_iters, min_lr, grad_clip.
   280      CONFIG_OVERRIDES = {}
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)

### gpt-345m (scripts/gpt_345m.sh)
Training metrics (last steps):
TRAIN_METRICS: step=9000, train_loss=2.6829, val_loss=2.4092
TRAIN_METRICS: step=10000, train_loss=2.6616, val_loss=2.3810
TRAIN_METRICS: step=11000, train_loss=2.6436, val_loss=2.3713
TRAIN_METRICS: step=12000, train_loss=2.6253, val_loss=2.3554
TRAIN_METRICS: step=13000, train_loss=2.6097, val_loss=2.3447
Final metrics (gpt-345m): val_loss_gpt-345m=2.3380, wikitext2_ppl_gpt-345m=47.2000, lambada_ppl_gpt-345m=71.0600

### lm-eval-345m (scripts/lm_eval_345m.sh)
Final metrics (lm-eval-345m): arc_easy_lm-eval-345m=55.1800, hellaswag_lm-eval-345m=33.2000, piqa_lm-eval-345m=64.2000, winogrande_lm-eval-345m=52.3300

[Leaderboard] Results saved: {'val_loss_gpt-345m': 2.338, 'wikitext2_ppl_gpt-345m': 47.2, 'lambada_ppl_gpt-345m': 71.06, 'elapsed_gpt-345m': 22940, 'arc_easy_lm-eval-345m': 55.18, 'hellaswag_lm-eval-345m': 33.2, 'piqa_l
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "nanoGPT/custom_pretrain.py",
4  "start_line": 188,
5  "end_line": 222,
6  "content": "def compute_loss(logits, targets):\n    \"\"\"Margin-aware self-distilled CE with training-only adaptive smoothing.\n\n    Novelty:\n    - Build a detached teacher distribution from the current logits using a softened\n      temperature and margin gating based on the gap between the gold token and the\n      strongest competing token.\n    - Interpolate hard labels with this teacher only on ambiguous tokens.\n    - Add a tiny z-loss for logit-scale control.\n    - Use plain CE at eval time for metric comparability.\n    \"\"\"\n    flat_logits = logits.view(-1, logits.size(-1)).float()\n    flat_targets = targets.view(-1)\n    if not torch.is_grad_enabled():\n        return F.cross_entropy(flat_logits, flat_targets, ignore_index=-1)\n\n    mask = flat_targets != -1\n    valid_logits = flat_logits[mask]\n    valid_targets = flat_targets[mask]\n    log_probs = F.log_softmax(valid_logits, dim=-1)\n    nll = -log_probs.gather(1, valid_targets.unsqueeze(1)).squeeze(1)\n\n    with torch.no_grad():\n        target_logits = valid_logits.gather(1, valid_targets.unsqueeze(1)).squeeze(1)\n        competitor_logits = valid_logits.clone()\n        competitor_logits.scatter_(1, valid_targets.unsqueeze(1), -1e9)\n        best_other = competitor_logits.max(dim=-1).values\n        margin = target_logits - best_other\n\n        # More smoothing/distillation when the gold-vs-competitor margin is small or negative.\n        gate = torch.sigmoid(-margin / 0.75)\n        eps = 0.01 + 0.07 * gate\n\n        teacher_probs = F.softmax(valid_logits / 1.35, dim=-1)\n        one_hot = F.one_hot(valid_targets, num_classes=valid_logits.size(-1)).to(valid_logits.dtype)\n        mixed_targets = (1.0 - eps).unsqueeze(1) * one_hot + eps.unsqueeze(1) * teacher_probs\n\n    soft_ce = -(mixed_targets * log_probs).sum(dim=-1).mean()\n    log_z = torch.logsumexp(valid_logits, dim=-1)\n    z_reg = (log_z ** 2).mean()\n\n    # Blend soft-target CE with a small hard-label anchor for stability.\n    return 0.85 * soft_ce + 0.15 * nll.mean() + 3e-5 * z_reg"
7}
Tool ResultStep 3
OK: Replaced lines 188..222 with 43 line(s) in nanoGPT/custom_pretrain.py. Editable range: 188–230, 286–288.

[Current file: nanoGPT/custom_pretrain.py | editable: 188–230, 286–288 | total: 478 lines]
   188  def compute_loss(logits, targets):
   189      """Margin-aware self-distilled CE with training-only adaptive smoothing.
   190  
       ... (37 more lines) ...
   228  
   229      # Blend soft-target CE with a small hard-label anchor for stability.
   230      return 0.85 * soft_ce + 0.15 * nll.mean() + 3e-5 * z_reg
...
   286      # CONFIG_OVERRIDES: override training hyperparameters for your method.
   287      # Allowed keys: learning_rate, weight_decay, warmup_iters, min_lr, grad_clip.
   288      CONFIG_OVERRIDES = {}
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)

### gpt-345m (scripts/gpt_345m.sh)
Training metrics (last steps):
TRAIN_METRICS: step=9000, train_loss=2.6380, val_loss=2.3605
TRAIN_METRICS: step=10000, train_loss=2.6159, val_loss=2.3330
TRAIN_METRICS: step=11000, train_loss=2.5969, val_loss=2.3227
TRAIN_METRICS: step=12000, train_loss=2.5779, val_loss=2.3056
TRAIN_METRICS: step=13000, train_loss=2.5625, val_loss=2.2946
Final metrics (gpt-345m): val_loss_gpt-345m=2.2883, wikitext2_ppl_gpt-345m=44.8400, lambada_ppl_gpt-345m=69.4200

### lm-eval-345m (scripts/lm_eval_345m.sh)
Final metrics (lm-eval-345m): arc_easy_lm-eval-345m=54.6700, hellaswag_lm-eval-345m=33.5300, piqa_lm-eval-345m=63.8200, winogrande_lm-eval-345m=51.7000

[Leaderboard] Results saved: {'val_loss_gpt-345m': 2.2883, 'wikitext2_ppl_gpt-345m': 44.84, 'lambada_ppl_gpt-345m': 69.42, 'elapsed_gpt-345m': 23980, 'arc_easy_lm-eval-345m': 54.67, 'hellaswag_lm-eval-345m': 33.53, 'piqa
AssistantStep 5submit
Tool Input
1{
2  "n": 2
3}
Tool ResultStep 5
[submit] Submitting result from test #2 as final.

### gpt-345m (scripts/gpt_345m.sh)
Training metrics (last steps):
TRAIN_METRICS: step=9000, train_loss=2.6380, val_loss=2.3605
TRAIN_METRICS: step=10000, train_loss=2.6159, val_loss=2.3330
TRAIN_METRICS: step=11000, train_loss=2.5969, val_loss=2.3227
TRAIN_METRICS: step=12000, train_loss=2.5779, val_loss=2.3056
TRAIN_METRICS: step=13000, train_loss=2.5625, val_loss=2.2946
Final metrics (gpt-345m): val_loss_gpt-345m=2.2883, wikitext2_ppl_gpt-345m=44.8400, lambada_ppl_gpt-345m=69.4200

### lm-eval-345m (scripts/lm_eval_345m.sh)
Final metrics (lm-eval-345m): arc_easy_lm-eval-345m=54.6700, hellaswag_lm-eval-345m=33.5300, piqa_lm-eval-345m=63.8200, winogrande_lm-eval-345m=51.7000

[submit] Finalized test #2 as final.

[Leaderboard] Results saved: {'val_loss_gpt-345m': 2.2883, 'wikitext2_ppl_gpt-345m': 44.84, 'lambada_ppl_gpt-345m': 69.42, 'elapsed_gpt-345m': 23980, 'arc_easy_lm-eval-345m': 54.67, 'hellaswag_lm-eval-345m': 33.53, 'piqa_lm-ev