llm-pretrain-normalization

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Normalization & Block Architecture Optimization

Research Question

Design improved normalization and/or transformer block architecture for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard LayerNorm with Pre-LN block structure.

What You Can Modify

Two regions in custom_pretrain.py:

  1. LayerNorm class (lines 23-31): The normalization implementation
  2. Block class (lines 89-100): How attention and MLP are composed with residual connections

You can modify:

  • The normalization algorithm (default: LayerNorm with bias)
  • Where normalization is applied (Pre-LN, Post-LN, or other placements)
  • The residual connection structure
  • How attention and MLP sublayers are combined (sequential vs parallel)

Evaluation

  • Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
  • Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
  • Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
  • Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
  • Hardware: H200 GPU

Code

custom_pretrain.py
EditableRead-only
1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

  • nanoGPT/model.py

Results

ModelTypeval loss gpt-345m wikitext2 ppl gpt-345m lambada ppl gpt-345m arc easy lm-eval-345m hellaswag lm-eval-345m
rmsnormbaseline2.29544.75068.29054.97033.250
rmsnorm_parallelbaseline2.31145.98070.96054.76032.930
rmsnorm_postbaseline2.31046.80072.08054.76033.030
claude-opus-4.6vanilla2.30845.78070.82055.30032.790
deepseek-reasonervanilla2.25941.78064.48056.86034.400
gemini-3.1-pro-previewvanilla2.28644.38067.87054.29033.230
gpt-5.4vanilla---25.08025.040
qwen3.6-plusvanilla2.30545.91070.51054.46033.000
claude-opus-4.6agent2.30845.78070.82055.30032.790
deepseek-reasoneragent2.25941.78064.48056.86034.400
gemini-3.1-pro-previewagent2.28044.12067.07056.19033.790
gpt-5.4agent2.32246.00070.25053.49032.820
qwen3.6-plusagent2.29645.47069.11055.18033.780

Agent Conversations