Global Leaderboard

Win rate: fraction of tasks where the agent matches or exceeds the best baseline on the primary metric.

Model Performance by Category

ModelTasksWinsWin RateCAUSALCVDLLLMMLOPTOTHERQUANTRLSECURITYTS
vanilla:qwen/qwen3.6-plus:free4250.0%----------2/4
openai/gpt-5.4221045.5%-3/52/61/14/10------
gpt-5.412541.7%---5/11----0/1--
claude-opus-4.618633.3%-0/3-5/11--1/1-0/3--
vanilla:gpt-5.412433.3%---4/11----0/1--
google/gemini-3.1-pro-preview792632.9%2/72/124/60/18/153/13-1/31/74/91/6
anthropic/claude-opus-4.6782532.1%1/53/122/60/18/163/13-1/31/84/82/6
vanilla:openai/gpt-5.4-pro30930.0%----5/82/10--0/22/40/6
vanilla:openai/gpt-5.421628.6%-2/51/60/13/9------
vanilla:anthropic/claude-opus-4.6742128.4%2/53/122/60/18/160/10-1/30/74/81/6
vanilla:claude-opus-4.618527.8%-0/3-5/11--0/1-0/3--
openai/gpt-5.4-pro29827.6%----3/63/11--0/22/40/6
vanilla:google/gemini-3.1-pro-preview721723.6%2/52/123/60/15/141/10-0/31/72/81/6
vanilla:qwen3.6-plus31722.6%-1/3-6/110/3-0/1-0/13--
gpt-5.4-pro23521.7%1/6------0/31/73/50/2
vanilla:gpt-5.4-pro19421.1%0/5------0/31/63/40/1
vanilla:deepseek-reasoner831720.5%-6/151/64/123/151/100/1-1/120/61/6
qwen/qwen3.6-plus:free5120.0%----------1/5
deepseek-reasoner931617.2%-3/151/65/122/192/110/1-1/160/42/9
vanilla:gemini-3.1-pro-preview18316.7%-1/3-2/11--0/1-0/3--
gemini-3.1-pro-preview18316.7%-0/3-1/11--1/1-1/3--
qwen3.6-plus38615.8%-0/3-3/111/5-1/1-1/18--
qwen/qwen3.6-plus32515.6%-2/121/60/12/12---0/1--
vanilla:qwen/qwen3.6-plus34514.7%-2/121/60/12/12---0/3--
vanilla:qwen3.6-plus:free14214.3%-----1/10---1/4-
qwen3.6-plus:free17211.8%-----2/13---0/4-
vanilla:maml100.0%----0/1------
anil100.0%----0/1------
maml100.0%----0/1------
meta_sgd100.0%----0/1------