llm-pretrain-loss

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Loss Function Optimization

Research Question

Design an improved loss function for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to standard cross-entropy.

What You Can Modify

The compute_loss function (lines 189-191) in custom_pretrain.py:

  • Loss function formulation (default: standard cross-entropy)
  • Logit processing (e.g., softcapping, temperature scaling)
  • Regularization terms (e.g., z-loss, entropy penalties)
  • Label distribution modifications (e.g., label smoothing)

Note: The function signature compute_loss(logits, targets) must be preserved. logits has shape (B, T, V) and targets has shape (B, T). The function is called inside the model's forward pass during training.

Evaluation

  • Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
  • Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
  • Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
  • Training: 13535 iterations, BSZ=64, GA=8, 2-GPU DDP
  • Hardware: H200 GPU

Code

custom_pretrain.py
EditableRead-only
1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

  • nanoGPT/model.py

Results

ModelTypeval loss gpt-345m wikitext2 ppl gpt-345m lambada ppl gpt-345m arc easy lm-eval-345m hellaswag lm-eval-345m
label_smoothingbaseline2.33847.13071.80054.04033.630
softcap_cebaseline2.27043.46067.10056.48031.820
z_lossbaseline2.29344.09068.25054.55033.850
claude-opus-4.6vanilla2.33847.91072.30053.20033.200
deepseek-reasonervanilla2.34146.95071.32055.18033.350
gemini-3.1-pro-previewvanilla2.29345.13069.78056.14033.480
gpt-5.4vanilla2.33847.20071.06055.18033.200
qwen3.6-plusvanilla3.491125.290178.80052.57033.250
claude-opus-4.6agent2.33847.91072.30053.20033.200
deepseek-reasoneragent2.34047.53072.24054.34033.300
gemini-3.1-pro-previewagent2.29345.13069.78056.14033.480
gpt-5.4agent2.28844.84069.42054.67033.530
qwen3.6-plusagent2.30145.42069.02053.75033.100

Agent Conversations