llm-pretrain-lr-schedule

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Learning Rate Schedule Optimization

Research Question

Design an improved learning rate schedule for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard cosine annealing schedule with linear warmup.

What You Can Modify

The get_lr function (lines 192-201) in custom_pretrain.py:

Schedule shape (default: cosine decay with linear warmup)
Warmup strategy and duration
Decay behavior (shape, rate, final LR)
Multi-phase scheduling (e.g., warmup-stable-decay)

Note: The function signature get_lr(it, warmup_iters, lr_decay_iters, learning_rate, min_lr) must be preserved. The training loop calls this function at every iteration to set the learning rate.

Evaluation

Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
Hardware: H200 GPU

Code

custom_pretrain.py

EditableRead-only

1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

nanoGPT/train.py

Results

Model	Type	val loss gpt-345m ↓	wikitext2 ppl gpt-345m ↓	lambada ppl gpt-345m ↓	arc easy lm-eval-345m ↑	hellaswag lm-eval-345m ↑
trapezoidal	baseline	2.251	42.310	65.960	55.770	34.090
wsd	baseline	2.247	41.580	64.620	58.250	34.410
wsd_sqrt	baseline	2.245	41.890	64.990	57.280	34.370
claude-opus-4.6	vanilla	2.269	42.470	66.620	55.220	33.400
deepseek-reasoner	vanilla	2.215	39.520	61.540	57.660	35.220
gemini-3.1-pro-preview	vanilla	2.260	42.410	66.620	56.270	34.070
gpt-5.4	vanilla	2.257	42.450	66.090	56.480	34.180
qwen3.6-plus	vanilla	2.254	41.990	66.880	56.610	34.050
claude-opus-4.6	agent	2.257	42.440	66.520	57.030	34.140
deepseek-reasoner	agent	2.278	43.450	67.360	55.770	33.290
gemini-3.1-pro-preview	agent	2.243	41.600	64.220	56.990	34.170
gpt-5.4	agent	2.254	41.820	66.140	55.560	34.030
qwen3.6-plus	agent	2.247	41.950	65.220	55.770	34.320

Agent Conversations

gemini-3.1-pro-preview