MLS-Bench Paper Notes

Can AI Systems Build Better AI?

MLS-Bench asks whether an agent can make a reusable machine-learning science contribution: a new component, objective, optimizer, or training procedure that is scalable and generalizable.

Most agent benchmarks reward engineering around one fixed instance: clean the data, tune the pipeline, debug the run, select the model, and climb a leaderboard. Those skills matter, but they do not isolate the scientific step that changes how machine learning systems are built.

MLS-Bench targets that narrower and harder question. Each task fixes a research scaffold, gives the agent the relevant source code and strong baseline implementations, then asks for one algorithmic change inside a controlled edit surface. The submitted result has to generalize across settings, seeds, datasets, environments, or model scales.

The current result is clear: frontier agents can code and iterate, but they are still far from reliably surpassing strong human-designed methods. The gap is not just proposal quality. It is also scientific judgment: choosing what to test, spending limited compute, reading feedback, and knowing when a result supports a scalable claim.

What the Field Is Actually Working On

One direction in agent research is conspicuously crowded right now: self-evolve. Search over executable artifacts, score the outputs, keep the best ones, repeat. Almost every recent "AI discovers" headline comes out of some version of this loop. The shape is real and the wins are real — the question is what the wins have actually been wins on.

The Self-Evolve Wave Everyone Is Riding

A non-exhaustive snapshot of the systems in this cluster. Every one of them shares the same loop, and most of them cite each other in the same papers. Listed in no particular order.

AlphaEvolve

Programs · kernels · circuits · math

DeepMind's tree-search code evolution, demonstrated on GPU/TPU kernels, TPU multiplier circuits, and math frontier problems.

FunSearch

Programs for combinatorial search

LLM-guided program search; new lower bounds on cap-set and online bin-packing.

OpenEvolve

Open-source AlphaEvolve loop

Community implementation evolving code against an executable scorer.

ThetaEvolve

Test-time training runs

Adapts solvers during test time using executable feedback on open problems.

ShinkaEvolve

Programs · sample-efficient evolution

Self-improving program evolution focused on cheaper iteration steps.

AlphaActivation

Activation functions

Searches activation formulas and tests how well they transfer across model settings.

AVO

Execution-grounded operators

Evolves variation operators using executable feedback from candidate runs.

AI Scientist v2

Autonomous research loops

Sakana's loop spans ideation, experiments, writing, and self-review on small studies.

…And What Those Wins Are Actually On

Run down the headline results from this cluster and a pattern jumps out. Pack circles in a square. Save one multiplication on 2×2 matmul. Trim a constant in an online bin-packing heuristic. Find a slightly larger cap set. Nudge a kernel or a multiplier circuit on one specific chip. They are exactly the right shape for an executable verifier — sharp, scalar, low-dimensional — and exactly the wrong shape for a claim about ML science.

Circle Packing

AlphaEvolve · pack N in a square

Find a tighter packing of N equal circles in the unit square. Headline result: a slightly improved arrangement for n=11.

2×2 Matmul Scheme

AlphaEvolve · scalar multiplications

Multiply two 2×2 matrices in fewer than 8 elementary multiplications. Strassen reached 7 in 1969; the agent matched it.

Matmul GPU Kernel

AlphaEvolve · one tile, one accelerator

Discover a faster low-level kernel for one fixed tile size on one accelerator (e.g. a FlashAttention building block).

TPU Multiplier Circuit

AlphaEvolve · fixed bit-width

Search for a smaller arithmetic circuit on one chip at a fixed bit-width, scored by gate / transistor count.

Online Bin Packing

FunSearch · worst-case ratio

Place arriving items into the fewest bins; tighten a small constant in the worst-case ratio of an online heuristic.

Cap Sets in F₃ⁿ

FunSearch · combinatorial bound

Find a larger subset of {0,1,2}ⁿ with no 3-term arithmetic progression. New lower bound on a single combinatorial constant.

That is the gap MLS-Bench is built for. The executable, agentic shape these systems pioneered is genuinely powerful, but applying it to circle packing or 2×2 matmul schemes is not the same thing as discovering an ML method. MLS-Bench keeps the harness and moves the target: discover a real ML-science change — a model component, objective, optimizer, or training procedure — that survives across datasets, seeds, environments, and model scales.

What MLS-Bench Measures

A task is meant to feel like a small research iteration rather than a general coding assignment. The benchmark covers 140 tasks across 12areas of ML — language models, reinforcement learning, robotics, vision and generation, ML systems, AI for science, optimization, time series, causal reasoning, trustworthy learning, classical learning, and deep learning. Hover any category below to peek at the actual tasks inside.

The key design constraint is attribution. Agents can edit the target component, but they cannot rewrite the evaluation harness, change the shared training protocol, inflate model capacity, or probe hidden settings. Baselines are implemented in the same scaffold, so a score gain is measured against methods the task can already express.

Scores are normalized per task. The weakest reproduced baseline anchors the floor, the strongest reproduced baseline anchors the human-SOTA reference, and each method is aggregated across metrics and settings. That makes cross-area comparisons possible without hiding the fact that raw metrics differ by domain.

Compute profile showing GPU versus CPU tasks and per-task GPU-hour distribution.
Compute profile for the benchmark. The full suite is intentionally broad; MLS-Bench-Lite gives a smaller 30-task track for faster iteration.

Main Result

The main experiments evaluate Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek-V3.2, and Qwen 3.6 Plus on the full benchmark. Agents receive baseline implementations and can make up to three test calls before submitting a final result. The paper reports both the first proposal, called Vanilla, and the final submitted Agent result.

Iteration helps, but mostly by narrowing the gap. The agents often improve over their first attempt, yet they do not reliably reach the best reproduced human method inside the same controlled scaffold. The benchmark is therefore unsaturated in the capability it is designed to measure.

Why the Controls Matter

The ablations separate method discovery from easier ways to improve a number. When prompts invite engineering optimization, weaker agents can gain by tuning, recombining known tricks, or polishing an existing implementation. That improvement is real engineering work, but it is not the same as discovering a new method.

Capacity checks are also necessary. If the task allows a model-size increase, some agents exploit the open surface and win by making the model larger rather than by improving the algorithm. The controlled benchmark closes that route so the remaining gain is tied to the intended research axis.

Multi-setting evaluation is the other half of the control. A useful method should carry from in-domain settings to out-of-distribution ones. The benchmark tracks whether iteration improves both, rather than only fitting the cases the agent can see.

Ablation chart comparing scientific innovation and engineering optimization prompts.
Innovation prompt versus engineering optimization prompt.
Ablation chart showing the effect of capacity-budget controls.
Capacity-budget control blocks model-size shortcuts.
Ablation chart comparing in-distribution and out-of-distribution settings.
In-domain and OOD settings test whether improvements transfer.

More Compute Is Not Enough

Test-time scaling improves simpler tasks for a while. Sampling more first proposals or giving the agent more exploration rounds can raise the running-best score. But the gains saturate, and harder tasks keep a ceiling below human SOTA.

Evolutionary search and test-time training expose a sharper failure mode under partial feedback. They can preserve or improve in-domain scores while damaging OOD settings. That is optimization of the observed verifier, not discovery of a method that transfers.

Test-time scaling chart comparing sampling, exploration, evolution, and training.
Figure: test-time scaling. Extra inference-time support helps in easier regimes but can overfit in-domain settings when feedback is incomplete.

The Hard Part Is Evidence

Real ML science is verifier-limited. Running a proxy experiment is not the same as establishing that a method scales. MLS-Bench simulates this pressure in an adaptive compute-allocation experiment for LLM pretraining tasks: agents choose model sizes, token budgets, and how to spend limited trials.

More freedom does not automatically help. Most agents perform worse under the adaptive protocol than under the simpler fixed one. This points to a capability gap beyond method proposal: current systems struggle to plan informative experiments and convert feedback into reliable evidence.

Adaptive compute-allocation figure showing exploration trajectories and final scores.
Figure: adaptive compute allocation. Giving agents more experimental choices can lower performance when they lack the judgment to spend compute well.

Context Helps Less Than Expected

Extra context can help a strong model, but it does not remove the main bottleneck. Web search, detailed baseline explanations, and theory background produce modest gains, often comparable to ordinary iterative refinement.

The failure is not simply missing knowledge. The harder step is turning context into a hypothesis that is relevant to the scaffold, implementable in the allowed edit surface, and robust across settings.

Context engineering chart comparing web search, baseline context, and theory context.
Context engineering. Additional information helps selectively, but knowledge access is not the limiting factor by itself.

What Agents Actually Do

Expert assessment finds a recurring pattern. Agents often recombine ingredients from the baseline implementations they are shown and present the recombination as a new method. Truly new mechanisms are rarer, and when they appear, the justification for why they should work is often thin.

Code-similarity analysis supports the same diagnosis. For weaker models, performance is strongly tied to resemblance to high-scoring baselines. That is useful signal: copying the shape of a strong method is a reasonable fallback, but it is not the capability MLS-Bench is meant to celebrate.

Innovation analysis scatter plot comparing baseline similarity weighted performance and agent score.
Innovation analysis. Lower-performing agents are more likely to track the strong baselines their code resembles.

Where This Leaves the Benchmark

MLS-Bench is a measurement tool for a specific gap: turning an idea into a reusable machine-learning method. Stronger models, better agent harnesses, and self-improving discovery systems should move the leaderboard, but the target remains the same: improve the method, keep the evaluation fair, and show that the gain survives beyond the first setting.

The gap we measure is wider than just proposing methods. It extends to turning an idea into evidence: deciding what to test, how to spend limited trials, and when a result supports a scalable claim. Today's agents look even weaker at this evidence-building process than at method proposal itself — a sign that better search alone is not scientific discovery. Real discovery requires forming questions, learning from trials, allocating time and compute, and turning experiments into transferable claims. MLS-Bench is built to make that distinction measurable.

Limitations and What We're Building Next

MLS-Bench is only a first step. ML science is too broad and fast-moving for one benchmark to exhaust, and the trade-off between rigor and freedom is one we will keep tuning. We are actively expanding the task catalog toward more open-ended research questions, refining evaluation protocols so they stay rigorous as the freedom we give agents grows, and tightening the tooling around evidence-building so the benchmark continues to measure scientific judgment, not just method search.

If you would like to contribute — new tasks, evaluation ideas, baselines, or analyses — please reach out: Bohan Lyu (bohan@berkeley.edu).

Continue with the task catalog on the Tasks page, or open the GitHub repository for the executable benchmark.

Citation

@misc{lyu2026mlsbenchholisticrigorousassessment,
      title={MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI},
      author={Bohan Lyu and Yucheng Yang and Siqiao Huang and Jiaru Zhang and Qixin Xu and Xinghan Li and Xinyang Han and Yicheng Zhang and Huaqing Zhang and Runhan Huang and Kaicheng Yang and Zitao Chen and Wentao Guo and Junlin Yang and Xinyue Ai and Wenhao Chai and Yadi Cao and Ziran Yang and Kun Wang and Dapeng Jiang and Huan-ang Gao and Shange Tang and Chengshuai Shi and Simon S. Du and Max Simchowitz and Jiantao Jiao and Dawn Song and Chi Jin},
      year={2026},
      eprint={2605.08678},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08678},
}