llm-pretrain-attention

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Attention Mechanism Optimization

Research Question

Design an improved self-attention mechanism for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard multi-head attention with learned absolute position embeddings.

What You Can Modify

The CausalSelfAttention class (lines 34-70 in custom_pretrain.py), including:

Position encoding scheme (the default uses learned absolute position embeddings via wpe)
Query/Key/Value computation and projection
Attention score computation and masking
Any attention-related hyperparameters

Note: If your attention mechanism implements its own position encoding (replacing the learned wpe), set self.use_pos_emb = False in __init__ — the model will then skip adding position embeddings in the forward pass.

Evaluation

Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
Training: 13535 iterations, BSZ=64, GA=8, 2-GPU DDP
Hardware: H200 GPU

Code

custom_pretrain.py

EditableRead-only

1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

nanoGPT/model.py

Results

Model	Type	val loss gpt-345m ↓	wikitext2 ppl gpt-345m ↓	lambada ppl gpt-345m ↓	arc easy lm-eval-345m ↑	hellaswag lm-eval-345m ↑
qk_norm	baseline	2.288	43.650	69.990	55.640	33.410
rope	baseline	2.257	43.170	65.810	57.320	34.480
rope_qk_norm	baseline	2.259	43.440	67.200	57.830	34.240
claude-opus-4.6	vanilla	2.246	42.220	66.130	58.380	34.600
deepseek-reasoner	vanilla	2.221	40.080	61.810	57.490	35.370
gemini-3.1-pro-preview	vanilla	2.260	43.060	65.370	56.570	34.570
gpt-5.4	vanilla	-	-	-	25.080	25.040
qwen3.6-plus	vanilla	2.246	42.570	66.150	56.690	34.630
claude-opus-4.6	agent	2.246	42.220	66.130	58.380	34.600
deepseek-reasoner	agent	2.221	40.080	61.810	57.490	35.370
gemini-3.1-pro-preview	agent	2.255	41.450	64.800	57.700	34.550
gpt-5.4	agent	-	-	-	25.080	25.040
qwen3.6-plus	agent	2.246	42.570	66.150	56.690	34.630

Agent Conversations

gemini-3.1-pro-preview