llm-pretrain-normalization

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Normalization & Block Architecture Optimization

Research Question

Design improved normalization and/or transformer block architecture for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard LayerNorm with Pre-LN block structure.

What You Can Modify

Two regions in custom_pretrain.py:

LayerNorm class (lines 23-31): The normalization implementation
Block class (lines 89-100): How attention and MLP are composed with residual connections

You can modify:

The normalization algorithm (default: LayerNorm with bias)
Where normalization is applied (Pre-LN, Post-LN, or other placements)
The residual connection structure
How attention and MLP sublayers are combined (sequential vs parallel)

Evaluation

Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
Hardware: H200 GPU

Code

custom_pretrain.py

EditableRead-only

1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

nanoGPT/model.py

Results

Model	Type	val loss gpt-345m ↓	wikitext2 ppl gpt-345m ↓	lambada ppl gpt-345m ↓	arc easy lm-eval-345m ↑	hellaswag lm-eval-345m ↑
rmsnorm	baseline	2.295	44.750	68.290	54.970	33.250
rmsnorm_parallel	baseline	2.311	45.980	70.960	54.760	32.930
rmsnorm_post	baseline	2.310	46.800	72.080	54.760	33.030
claude-opus-4.6	vanilla	2.308	45.780	70.820	55.300	32.790
deepseek-reasoner	vanilla	2.259	41.780	64.480	56.860	34.400
gemini-3.1-pro-preview	vanilla	2.286	44.380	67.870	54.290	33.230
gpt-5.4	vanilla	-	-	-	25.080	25.040
qwen3.6-plus	vanilla	2.305	45.910	70.510	54.460	33.000
claude-opus-4.6	agent	2.308	45.780	70.820	55.300	32.790
deepseek-reasoner	agent	2.259	41.780	64.480	56.860	34.400
gemini-3.1-pro-preview	agent	2.280	44.120	67.070	56.190	33.790
gpt-5.4	agent	2.322	46.000	70.250	53.490	32.820
qwen3.6-plus	agent	2.296	45.470	69.110	55.180	33.780

Agent Conversations

gemini-3.1-pro-preview