llm-pretrain-mlp

Language Modelslm-evaluation-harnessnanoGPTrigorous codebase

Description

LLM Pretraining: Feed-Forward Network Optimization

Research Question

Design an improved feed-forward network (MLP) for GPT-2 language model pretraining. Your modifications should reduce validation loss compared to the standard GELU MLP.

What You Can Modify

The MLP class (lines 73-86 in custom_pretrain.py), including:

Activation function (default: GELU)
Network architecture (default: two linear layers with 4x expansion)
Gating mechanisms
Hidden dimension sizing

Constraint: The MLP must accept input of shape (B, T, n_embd) and return output of the same shape.

Evaluation

Metric: Validation loss (cross-entropy, lower is better), plus perplexity (WikiText-2, LAMBADA) and downstream accuracy (ARC-Easy, HellaSwag, PIQA, WinoGrande)
Model: GPT-2 Medium (24L/16H/1024D, ~355M params)
Dataset: FineWeb 10B (GPT-2 tokenizer), ~7.1B tokens (D=20N Chinchilla-optimal)
Training: 12030 iterations, BSZ=96, GA=6, 2-GPU DDP
Hardware: H200 GPU

Code

custom_pretrain.py

EditableRead-only

1"""Custom GPT-2 Pretraining Script
2Based on Andrej Karpathy's nanoGPT, evaluated on FineWeb dataset.
3"""
4
5import math
6import inspect
7import os
8import time
9from contextlib import nullcontext
10from dataclasses import dataclass
11
12import numpy as np
13import torch
14import torch.nn as nn
15from torch.nn import functional as F

Additional context files (read-only):

nanoGPT/model.py

Results

Model	Type	val loss gpt-345m ↓	wikitext2 ppl gpt-345m ↓	lambada ppl gpt-345m ↓	arc easy lm-eval-345m ↑	hellaswag lm-eval-345m ↑
geglu	baseline	2.295	44.130	68.730	54.880	32.900
relu_squared	baseline	2.283	43.330	66.560	55.260	33.860
swiglu	baseline	2.292	44.330	66.810	54.710	33.400
claude-opus-4.6	vanilla	2.303	44.110	71.720	54.760	32.670
deepseek-reasoner	vanilla	2.313	44.300	68.150	52.650	33.320
gemini-3.1-pro-preview	vanilla	2.286	44.760	69.100	55.770	33.610
gpt-5.4	vanilla	2.284	43.230	67.220	52.860	33.190
qwen3.6-plus	vanilla	2.300	43.710	66.460	54.420	33.340
claude-opus-4.6	agent	2.299	43.970	68.070	54.120	33.620
deepseek-reasoner	agent	2.214	38.920	61.740	57.370	35.210
gemini-3.1-pro-preview	agent	2.292	43.350	66.230	54.420	33.340
gpt-5.4	agent	2.321	45.330	70.550	54.670	32.840
qwen3.6-plus	agent	2.300	43.710	66.460	54.420	33.340

Agent Conversations

deepseek-reasoner

7 steps

gemini-3.1-pro-preview