MLS-Bench

A holistic and rigorous assessment of AI systems on building better AI.

MLS-Bench overview: comparison of Frontier-CS, MLE-Bench, and MLS-Bench, plus 20 representative tasks across 12 domains.
MLS-Bench overview. Left: comparison of Frontier-CS, MLE-Bench, and MLS-Bench task. Right: 20 representative MLS-Bench tasks from 140 tasks across 12 domains.
UC BerkeleyPrinceton UniversityTsinghua UniversityPurdue UniversityUniversity of WashingtonHarvard UniversityUniversity of PennsylvaniaShanghai Jiao Tong UniversityUC San DiegoCarnegie Mellon University
Why this benchmark

Methods that stood the test of time and scale.

Modern AI progress is built on a small set of reusable ideas — convolutions, residual connections, attention, normalization — that generalize across architectures and survive every order-of-magnitude jump in scale.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

Convolution

Weight-shared receptive fields that scaled vision models.

Word2Vec

Distributed embeddings transferable across NLP tasks.

GAN

Adversarial generator–discriminator game for sample synthesis.

Adam

Adaptive moment estimation that became the default optimizer.

U-Net

Encoder–decoder with skip links — vision and diffusion staple.

PPO

Clipped policy ratio that made deep RL stable to scale.

RMSNorm

Mean-free normalization, faster and surprisingly sufficient.

RoPE

Rotary position encoding that scales with context length.

FlashAttention

IO-aware exact attention that scaled context length.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

LSTM

Gated recurrence enabling long-range sequence learning.

Dropout

Random unit masking that became the standard regularizer.

BatchNorm

Normalizing activations across the batch to stabilize training.

ResNet

Residual connections enabling 100+ layer training.

Transformer

Self-attention as the universal sequence operator.

Mixup

Input–label interpolation that improved generalization.

DDPM

Denoising diffusion: learn to invert a noise process.

LoRA

Low-rank adapters for parameter-efficient finetuning.

MLS-Bench tests whether AI agents can invent the next ones.

Each task isolates a well-defined research question and asks the agent to propose a single modular improvement — a new loss, an attention variant, a sampler, a routing rule — then measures whether the change transfers across models, datasets, and seeds.

140 executable tasks across 12 domains, each built around a targeted ML component, a controlled edit surface, and multi-setting evidence for transfer.

Model Performance by Category

Each model's bar shows Vanilla as the darker lower portion and Agent as the lighter overlay, against a translucent grey Human SOTA reference computed from the reproduced human baselines. Scores use the paper's normalized task metric.

Claude Opus 4.6GPT-5.4Gemini 3.1 ProDeepSeek-V3.2Qwen 3.6 Plus
Vanilla (darker)Agent (lighter)Human SOTA

MLS-Bench-Lite Intelligence

Average normalized score on the 30-task Lite subset.

Lite is only a subset. We recommend evaluating your harness and model on the full 140-task benchmark. For our in-depth analysis of the 5 main models on all 140 tasks, see our blog or the arXiv paper.