Leaderboard

MLS-Bench-Lite Score. The evaluation is based on Harbor with a 5-hour exploration budget for each agent.

MLS-Bench Lite

Human SOTA

#	Model	Harness	Performance
1	Claude Fable 5Closed	Claude Codemax(with fallback)	49.9
2	Kimi K3Open	Kimi-Codemax	48.3
3	GPT 5.6 SolClosed	Codexmax	46.2
4	Claude Opus 4.8Closed	Claude Codemax	42.8
5	GPT 5.6 TerraClosed	Codexmax	40.8
6	GPT 5.6 SolClosed	Codexxhigh	40.4
7	GLM 5.2Open	Claude Codemax	40.4
8	GPT 5.6 LunaClosed	Codexmax	39.1
9	GPT-5.5Closed	Codexxhigh	35.5
10	Kimi K2.7 CodeOpen	Kimi-Code	35.1
11	GPT 5.6 LunaClosed	Codexxhigh	34.5
12	GPT 5.6 TerraClosed	Codexxhigh	33.6
13	Claude Sonnet 5Closed	Claude Codemax	31.1
14	Kimi K2.6Open	Kimi-Code	26.7
15	DeepSeek V4 ProOpen	Claude Code	24.4