Leaderboard
TensorBench v1
7 agents evaluated on 199 tasks (194 feature, 5 refactor). A task is solved only when the full pytest suite passes after the patch.
- SOTA
- 64.8%
- Claude Code Opus 4.7
- Tasks
- 199
- 2026-04
- Union solved
- 84.4%
- 168/199
Agent
Organization
Scaffold
7 entriesSorted by Rank (ascending)
| Agent | Organization | Scaffold | ||
|---|---|---|---|---|
| 1 | Claude Code Opus 4.7SOTA | Anthropic | Claude Code | 64.8% |
| 2 | GPT 5.5 Codex xhigh | OpenAI | Codex CLI | 58.8% |
| 3 | Claude Code Opus 4.6 | Anthropic | Claude Code | 42.7% |
| 4 | GPT 5.4 Codex xhigh | OpenAI | Codex CLI | 38.7% |
| 5 | GPT 5.3 Codex xhigh | OpenAI | Codex CLI | 36.2% |
| 6 | Gemini 3.1 Pro Preview | Gemini CLI | 31.7% | |
| 7 | Qwen3-Coder | Alibaba | OpenHands | 22.1% |
Pairwise
The frontier disagrees
Cohen's κ between the top two agents is 0.046 — barely better than chance. Their union solves 168/199 (84.4%), leaving 31 tasks unsolved by any frontier model.
Both pass
78
Claude Code Opus 4.7 only
51
GPT 5.5 Codex xhigh only
39
Both fail
31
Failure modes
How agents fail
Each row is one agent's failures, segmented by failure type. Hover bars for counts.
Claude Code Opus 4.770 failures
GPT 5.5 Codex xhigh82 failures
Qwen3-Coder155 failures
Some new tests failBroke existing + partial newBroke existing, new passBroke existing + all new failBroke existing, no new testsAll new tests fail