v1.0 · 199 tasks·NeurIPS 2026 submission

A feature-addition benchmark for coding agents.

TensorBench asks an agent to add a feature to a real sparse-tensor compiler, then runs the full pytest suite inside Docker. A task is solved only when every test — original and agent-added — passes. Even the strongest agent solves just 64.8% — frontier coding agents still fail on more than a third of the suite.

View leaderboard Browse tasks pip install tensorbench

Tasks

199

194 feature · 5 refactor

SOTA pass rate

64.8%

Claude Code Opus 4.7

SOTA breaks tests

16.1%

of attempts regress the original suite

Median patch

808

lines · 11 edits

Leaderboard

Top of the table

Pass rate across the 199-task suite. Each task is graded by running the full pytest suite inside Docker on the patched repo.

#	Agent	Vendor	Scaffold	Pass rate	Solved	Broke existing
1	Claude Code Opus 4.7SOTA	Anthropic	Claude Code	64.8%	129/199	16.1%
2	GPT 5.5 Codex xhigh	OpenAI	Codex CLI	58.8%	117/199	23.6%
3	Claude Code Opus 4.6	Anthropic	Claude Code	42.7%	85/199	27.1%
4	GPT 5.4 Codex xhigh	OpenAI	Codex CLI	38.7%	77/199	36.7%
5	GPT 5.3 Codex xhigh	OpenAI	Codex CLI	36.2%	72/199	28.1%
6	Gemini 3.1 Pro Preview	Google	Gemini CLI	31.7%	63/199	44.7%
7	Qwen3-Coder	Alibaba	OpenHands	22.1%	44/199	45.2%

View full leaderboard

The benchmark

Add a real feature, then watch the suite run

Tasks ask the agent to implement something the codebase doesn't yet support — a new operator, a new format, a new pass. The agent typically writes both the implementation and its tests; the task is solved only when every test passes.

feature_elementwise_mulfeature

Elementwise Mul

Implement element-wise multiplication (`__mul__`) on `STensor` through the full CIN compilation pipeline, following the same pattern as the existing `__add__` implementation in `stensor.

API/Element-wise/Binary arithmetic

feature_transposefeature

Transpose

Add a `transpose()` method and `.

API/Shape & Layout/Transpose & permute

feature_sum_reductionfeature

Sum Reduction

Implement `sum(axis=None)` on `STensor` and as a standalone function in `ops.

API/Reductions & Scans/Aggregate

feature_sddmmfeature

Sddmm

Implement `sddmm(S, A, B)` as a first-class operation in `ops.

API/Linear Algebra/Matmul variants

feature_autogradfeature

Autograd

Add automatic differentiation support for scorch's core operations by implementing custom `torch.

API/ML Primitives/Autograd

feature_unary_opsfeature

Unary Ops

Add compiler-level support for unary operations on sparse tensors.

API/Element-wise/Unary math

Browse all 199 tasks

Failure modes

How frontier agents fail

Even on its 70 failures, the leading agent rarely breaks existing tests — most failures are partial new-test passes. Weaker agents skew toward breaking the original suite without contributing working tests.

Claude Code Opus 4.770 failures

GPT 5.5 Codex xhigh82 failures

Qwen3-Coder155 failures

Some new tests failBroke existing + partial newBroke existing, new passBroke existing + all new failBroke existing, no new testsAll new tests fail