数据来源:SWE-bench 官方 Bash-Only 排行榜(mini-SWE-agent v2.0.0,500 个实例,单次尝试)。数据获取于 2026 年 2 月。LMSYS Chatbot Arena 等排行榜因网络限制暂时无法获取。
| 排名 | 模型 | 厂商 | SWE-bench | 类型 |
|---|---|---|---|---|
| 🥇 | Claude 4.5 Opus | Anthropic | 76.8% | 闭源 |
| 🥈 | Gemini 3 Flash | Google DeepMind | 75.8% | 闭源 |
| 🥉 | MiniMax M2.5 | MiniMax | 75.8% | 闭源 |
| 4 | Claude Opus 4.6 | Anthropic | 75.6% | 闭源 |
| 5 | Claude 4.5 Opus (medium) | Anthropic | 74.4% | 闭源 |
| 6 | Gemini 3 Pro Preview | Google DeepMind | 74.2% | 闭源 |
| 7 | GLM-5 | Z-AI | 72.8% | 闭源 |
| 8 | GPT-5.2 | OpenAI | 72.8% | 闭源 |
| 9 | Claude 4.5 Sonnet | Anthropic | 71.4% | 闭源 |
| 10 | Kimi K2.5 | Moonshot AI | 70.8% | 闭源 |
| 11 | DeepSeek V3.2 | DeepSeek | 70.0% | 开源 |
| 12 | Gemini 3 Pro | Google DeepMind | 69.6% | 闭源 |
| 13 | Claude 4 Opus | Anthropic | 67.6% | 闭源 |
| 14 | Claude 4.5 Haiku | Anthropic | 66.6% | 闭源 |
| 15 | GPT-5.1 | OpenAI | 66.0% | 闭源 |
| 16 | GPT-5 | OpenAI | 65.0% | 闭源 |
| 17 | Claude 4 Sonnet | Anthropic | 64.9% | 闭源 |