DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole
A new benchmark called DeepSWE from Datacurve reveals significant differences in the performance of AI coding models, with OpenAI's GPT-5.5 emerging as the clear leader, scoring 70% compared to its nearest competitor. The benchmark critiques existing evaluation methods, highlighting a high error rate in widely used benchmarks like SWE-Bench Pro, which could mislead enterprise decisions regarding AI coding tools.
The DeepSWE benchmark from Datacurve reveals significant discrepancies in AI coding model performance that were previously masked by existing benchmarks like SWE-Bench Pro. Notably, OpenAI's GPT-5.5 emerges as a clear leader, outperforming competitors by a substantial margin. This finding suggests that enterprises should reassess their reliance on current benchmarks for AI model evaluation, as the prevalent error rates and potential for data contamination in existing benchmarks could lead to misguided procurement decisions.