MLS-Bench — Machine Learning Science Benchmark

What is MLS-Bench?

MLS-Bench (Machine Learning Science Benchmark) evaluates whether LLM agents can make generalizable, atomic ML science contributions — the kind of discoveries researchers make daily by modifying model architectures, loss functions, optimization strategies, and training procedures.

ML Science vs ML Engineering

We draw a sharp distinction between ML science and ML engineering:

ML Engineering is holistic: combine many techniques (feature engineering, ensembling, hyperparameter tuning, data augmentation, ...) to maximize a metric on one specific dataset or competition. This is what MLE-Bench evaluates.
ML Science is atomic and generalizable: discover a single modular improvement — like replacing LayerNorm with RMSNorm, inventing a new activation function, designing a better learning rate schedule — that transfers across models, datasets, and tasks.

Benchmark Design

Each task isolates a well-defined research question where the agent must propose and implement a novel algorithmic component, then demonstrate its effectiveness under controlled evaluation. Tasks span multiple domains including reinforcement learning, computer vision, language models, AI for science, and more.

Evaluation

Agents are evaluated on whether their proposed modifications improve upon established baselines. Each task provides:

A clear task description with the research question
Annotated source code with editable regions
Multiple baseline implementations for comparison
Automated evaluation across multiple random seeds

Citation

If you use MLS-Bench in your research, please cite:

@article{mlsbench2026,
  title={MLS-Bench: A Benchmark for Machine Learning Science},
  author={...},
  year={2026}
}

About MLS-Bench

What is MLS-Bench?

ML Science vs ML Engineering

Benchmark Design

Evaluation

Citation