Leaderboard

TensorBench v1

7 agents evaluated on 199 tasks (194 feature, 5 refactor). A task is solved only when the full pytest suite passes after the patch.

SOTA: 64.8%; Claude Code Opus 4.7
Tasks: 199; 2026-04
Union solved: 84.4%; 168/199

Agent

Organization

Scaffold

7 entriesSorted by Rank (ascending)

	Agent	Organization	Scaffold
1	Claude Code Opus 4.7SOTA	Anthropic	Claude Code	64.8%	129/199	16.1%	808
2	GPT 5.5 Codex xhigh	OpenAI	Codex CLI	58.8%	117/199	23.6%	693
3	Claude Code Opus 4.6	Anthropic	Claude Code	42.7%	85/199	27.1%	800
4	GPT 5.4 Codex xhigh	OpenAI	Codex CLI	38.7%	77/199	36.7%	591
5	GPT 5.3 Codex xhigh	OpenAI	Codex CLI	36.2%	72/199	28.1%	485
6	Gemini 3.1 Pro Preview	Google	Gemini CLI	31.7%	63/199	44.7%	324
7	Qwen3-Coder	Alibaba	OpenHands	22.1%	44/199	45.2%	334

Pairwise

The frontier disagrees

Cohen's κ between the top two agents is 0.046 — barely better than chance. Their union solves 168/199 (84.4%), leaving 31 tasks unsolved by any frontier model.

Both pass

Claude Code Opus 4.7 only

GPT 5.5 Codex xhigh only

Both fail

Failure modes

How agents fail

Each row is one agent's failures, segmented by failure type. Hover bars for counts.

Claude Code Opus 4.770 failures

GPT 5.5 Codex xhigh82 failures

Qwen3-Coder155 failures

Some new tests failBroke existing + partial newBroke existing, new passBroke existing + all new failBroke existing, no new testsAll new tests fail