llm-pretrain-optimizer
Language Modelslm-evaluation-harnessnanoGPTrigorous codebase
Description
LLM Pretraining: Optimizer & Learning Rate Schedule Optimization
Research Question
Design an improved optimizer and/or learning rate schedule for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard AdamW with cosine annealing schedule.
What You Can Modify
Two regions in custom_pretrain.py:
- configure_optimizers method (lines 172-189): Optimizer creation and parameter grouping
- get_lr function (lines 192-201): Learning rate schedule
You can modify:
- The optimization algorithm (default: AdamW with fused implementation)
- Parameter grouping strategy (default: weight decay for 2D params, no decay for 1D)
- Learning rate schedule shape (default: cosine with linear warmup)
- Any optimizer hyperparameters
Note: The training loop calls get_lr(it, warmup_iters, lr_decay_iters, learning_rate, min_lr) — keep this signature compatible. The optimizer returned by configure_optimizers must support .zero_grad(), .step(), and .param_groups.
Evaluation
- Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
- Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
- Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
- Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
- Hardware: H200 GPU
Code
custom_pretrain.py
EditableRead-only
1"""Custom GPT-2 Pretraining Script2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.3"""45import math6import inspect7import os8import time9from contextlib import nullcontext10from dataclasses import dataclass1112import numpy as np13import torch14import torch.nn as nn15from torch.nn import functional as F
Additional context files (read-only):
nanoGPT/model.pynanoGPT/train.py
Results
| Model | Type | val loss gpt-345m ↓ | wikitext2 ppl gpt-345m ↓ | lambada ppl gpt-345m ↓ | arc easy lm-eval-345m ↑ | hellaswag lm-eval-345m ↑ |
|---|---|---|---|---|---|---|
| adamw_nesterov | baseline | 2.323 | 46.960 | 71.820 | 55.180 | 32.750 |
| lion | baseline | 2.203 | 38.960 | 60.050 | 58.210 | 35.640 |
| muon | baseline | 2.200 | 37.980 | 60.080 | 60.190 | 36.850 |
| claude-opus-4.6 | vanilla | 2.200 | 37.630 | 59.710 | 60.140 | 36.880 |
| deepseek-reasoner | vanilla | 2.310 | 45.120 | 69.370 | 53.620 | 32.810 |
| gemini-3.1-pro-preview | vanilla | 2.222 | 39.110 | 62.340 | 60.100 | 35.410 |
| gpt-5.4 | vanilla | 2.255 | 42.420 | 67.940 | 57.910 | 34.050 |
| qwen3.6-plus | vanilla | 6.981 | 5585.830 | 4826.540 | 29.000 | 25.310 |
| claude-opus-4.6 | agent | 2.221 | 39.780 | 61.800 | 58.710 | 35.950 |
| deepseek-reasoner | agent | 2.310 | 45.120 | 69.370 | 53.620 | 32.810 |
| gemini-3.1-pro-preview | agent | 2.198 | 38.200 | 59.710 | 59.470 | 36.770 |
| gpt-5.4 | agent | 2.247 | 42.020 | 64.970 | 57.150 | 33.980 |
| qwen3.6-plus | agent | 2.173 | 37.140 | 59.580 | 59.640 | 37.010 |