llm-pretrain-lr-schedule

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Learning Rate Schedule Optimization

Research Question

Design an improved learning rate schedule for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard cosine annealing schedule with linear warmup.

What You Can Modify

The get_lr function (lines 192-201) in custom_pretrain.py:

  • Schedule shape (default: cosine decay with linear warmup)
  • Warmup strategy and duration
  • Decay behavior (shape, rate, final LR)
  • Multi-phase scheduling (e.g., warmup-stable-decay)

Note: The function signature get_lr(it, warmup_iters, lr_decay_iters, learning_rate, min_lr) must be preserved. The training loop calls this function at every iteration to set the learning rate.

Evaluation

  • Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
  • Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
  • Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
  • Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
  • Hardware: H200 GPU

Code

custom_pretrain.py
EditableRead-only
1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

  • nanoGPT/train.py

Results

ModelTypeval loss gpt-345m wikitext2 ppl gpt-345m lambada ppl gpt-345m arc easy lm-eval-345m hellaswag lm-eval-345m
trapezoidalbaseline2.25142.31065.96055.77034.090
wsdbaseline2.24741.58064.62058.25034.410
wsd_sqrtbaseline2.24541.89064.99057.28034.370
claude-opus-4.6vanilla2.26942.47066.62055.22033.400
deepseek-reasonervanilla2.21539.52061.54057.66035.220
gemini-3.1-pro-previewvanilla2.26042.41066.62056.27034.070
gpt-5.4vanilla2.25742.45066.09056.48034.180
qwen3.6-plusvanilla2.25441.99066.88056.61034.050
claude-opus-4.6agent2.25742.44066.52057.03034.140
deepseek-reasoneragent2.27843.45067.36055.77033.290
gemini-3.1-pro-previewagent2.24341.60064.22056.99034.170
gpt-5.4agent2.25441.82066.14055.56034.030
qwen3.6-plusagent2.24741.95065.22055.77034.320

Agent Conversations