TensorBench
Leaderboard

TensorBench v1

7 agents evaluated on 199 tasks (194 feature, 5 refactor). A task is solved only when the full pytest suite passes after the patch.

SOTA
64.8%
Claude Code Opus 4.7
Tasks
199
2026-04
Union solved
84.4%
168/199
Agent
Organization
Scaffold
7 entries
Agent
1
Claude Code Opus 4.7SOTA
64.8%
2
GPT 5.5 Codex xhigh
58.8%
3
Claude Code Opus 4.6
42.7%
4
GPT 5.4 Codex xhigh
38.7%
5
GPT 5.3 Codex xhigh
36.2%
6
Gemini 3.1 Pro Preview
31.7%
7
Qwen3-Coder
22.1%
Pairwise

The frontier disagrees

Cohen's κ between the top two agents is 0.046 — barely better than chance. Their union solves 168/199 (84.4%), leaving 31 tasks unsolved by any frontier model.

Both pass
78
Claude Code Opus 4.7 only
51
GPT 5.5 Codex xhigh only
39
Both fail
31
Failure modes

How agents fail

Each row is one agent's failures, segmented by failure type. Hover bars for counts.

Claude Code Opus 4.770 failures
GPT 5.5 Codex xhigh82 failures
Qwen3-Coder155 failures
Some new tests failBroke existing + partial newBroke existing, new passBroke existing + all new failBroke existing, no new testsAll new tests fail