Global Leaderboard
Win rate: fraction of tasks where the agent matches or exceeds the best baseline on the primary metric.
Model Performance by Category
| Model | Tasks | Wins | Win Rate | CAUSAL | CV | DL | LLM | ML | OPT | OTHER | QUANT | RL | SECURITY | TS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vanilla:qwen/qwen3.6-plus:free | 4 | 2 | 50.0% | - | - | - | - | - | - | - | - | - | - | 2/4 |
| openai/gpt-5.4 | 22 | 10 | 45.5% | - | 3/5 | 2/6 | 1/1 | 4/10 | - | - | - | - | - | - |
| gpt-5.4 | 12 | 5 | 41.7% | - | - | - | 5/11 | - | - | - | - | 0/1 | - | - |
| claude-opus-4.6 | 18 | 6 | 33.3% | - | 0/3 | - | 5/11 | - | - | 1/1 | - | 0/3 | - | - |
| vanilla:gpt-5.4 | 12 | 4 | 33.3% | - | - | - | 4/11 | - | - | - | - | 0/1 | - | - |
| google/gemini-3.1-pro-preview | 79 | 26 | 32.9% | 2/7 | 2/12 | 4/6 | 0/1 | 8/15 | 3/13 | - | 1/3 | 1/7 | 4/9 | 1/6 |
| anthropic/claude-opus-4.6 | 78 | 25 | 32.1% | 1/5 | 3/12 | 2/6 | 0/1 | 8/16 | 3/13 | - | 1/3 | 1/8 | 4/8 | 2/6 |
| vanilla:openai/gpt-5.4-pro | 30 | 9 | 30.0% | - | - | - | - | 5/8 | 2/10 | - | - | 0/2 | 2/4 | 0/6 |
| vanilla:openai/gpt-5.4 | 21 | 6 | 28.6% | - | 2/5 | 1/6 | 0/1 | 3/9 | - | - | - | - | - | - |
| vanilla:anthropic/claude-opus-4.6 | 74 | 21 | 28.4% | 2/5 | 3/12 | 2/6 | 0/1 | 8/16 | 0/10 | - | 1/3 | 0/7 | 4/8 | 1/6 |
| vanilla:claude-opus-4.6 | 18 | 5 | 27.8% | - | 0/3 | - | 5/11 | - | - | 0/1 | - | 0/3 | - | - |
| openai/gpt-5.4-pro | 29 | 8 | 27.6% | - | - | - | - | 3/6 | 3/11 | - | - | 0/2 | 2/4 | 0/6 |
| vanilla:google/gemini-3.1-pro-preview | 72 | 17 | 23.6% | 2/5 | 2/12 | 3/6 | 0/1 | 5/14 | 1/10 | - | 0/3 | 1/7 | 2/8 | 1/6 |
| vanilla:qwen3.6-plus | 31 | 7 | 22.6% | - | 1/3 | - | 6/11 | 0/3 | - | 0/1 | - | 0/13 | - | - |
| gpt-5.4-pro | 23 | 5 | 21.7% | 1/6 | - | - | - | - | - | - | 0/3 | 1/7 | 3/5 | 0/2 |
| vanilla:gpt-5.4-pro | 19 | 4 | 21.1% | 0/5 | - | - | - | - | - | - | 0/3 | 1/6 | 3/4 | 0/1 |
| vanilla:deepseek-reasoner | 83 | 17 | 20.5% | - | 6/15 | 1/6 | 4/12 | 3/15 | 1/10 | 0/1 | - | 1/12 | 0/6 | 1/6 |
| qwen/qwen3.6-plus:free | 5 | 1 | 20.0% | - | - | - | - | - | - | - | - | - | - | 1/5 |
| deepseek-reasoner | 93 | 16 | 17.2% | - | 3/15 | 1/6 | 5/12 | 2/19 | 2/11 | 0/1 | - | 1/16 | 0/4 | 2/9 |
| vanilla:gemini-3.1-pro-preview | 18 | 3 | 16.7% | - | 1/3 | - | 2/11 | - | - | 0/1 | - | 0/3 | - | - |
| gemini-3.1-pro-preview | 18 | 3 | 16.7% | - | 0/3 | - | 1/11 | - | - | 1/1 | - | 1/3 | - | - |
| qwen3.6-plus | 38 | 6 | 15.8% | - | 0/3 | - | 3/11 | 1/5 | - | 1/1 | - | 1/18 | - | - |
| qwen/qwen3.6-plus | 32 | 5 | 15.6% | - | 2/12 | 1/6 | 0/1 | 2/12 | - | - | - | 0/1 | - | - |
| vanilla:qwen/qwen3.6-plus | 34 | 5 | 14.7% | - | 2/12 | 1/6 | 0/1 | 2/12 | - | - | - | 0/3 | - | - |
| vanilla:qwen3.6-plus:free | 14 | 2 | 14.3% | - | - | - | - | - | 1/10 | - | - | - | 1/4 | - |
| qwen3.6-plus:free | 17 | 2 | 11.8% | - | - | - | - | - | 2/13 | - | - | - | 0/4 | - |
| vanilla:maml | 1 | 0 | 0.0% | - | - | - | - | 0/1 | - | - | - | - | - | - |
| anil | 1 | 0 | 0.0% | - | - | - | - | 0/1 | - | - | - | - | - | - |
| maml | 1 | 0 | 0.0% | - | - | - | - | 0/1 | - | - | - | - | - | - |
| meta_sgd | 1 | 0 | 0.0% | - | - | - | - | 0/1 | - | - | - | - | - | - |