TensorBench
v1.0 · 199 tasks·NeurIPS 2026 submission

A feature-addition benchmark for coding agents.

TensorBench asks an agent to add a feature to a real sparse-tensor compiler, then runs the full pytest suite inside Docker. A task is solved only when every test — original and agent-added — passes. Even the strongest agent solves just 64.8% — frontier coding agents still fail on more than a third of the suite.

View leaderboard Browse tasks pip install tensorbench
Tasks
199
194 feature · 5 refactor
SOTA pass rate
64.8%
Claude Code Opus 4.7
SOTA breaks tests
16.1%
of attempts regress the original suite
Median patch
808
lines · 11 edits
Leaderboard

Top of the table

Pass rate across the 199-task suite. Each task is graded by running the full pytest suite inside Docker on the patched repo.

#AgentPass rate
1
Claude Code Opus 4.7SOTA
64.8%
2
GPT 5.5 Codex xhigh
58.8%
3
Claude Code Opus 4.6
42.7%
4
GPT 5.4 Codex xhigh
38.7%
5
GPT 5.3 Codex xhigh
36.2%
6
Gemini 3.1 Pro Preview
31.7%
7
Qwen3-Coder
22.1%
The benchmark

Add a real feature, then watch the suite run

Tasks ask the agent to implement something the codebase doesn't yet support — a new operator, a new format, a new pass. The agent typically writes both the implementation and its tests; the task is solved only when every test passes.

feature_elementwise_mulfeature

Elementwise Mul

Implement element-wise multiplication (`__mul__`) on `STensor` through the full CIN compilation pipeline, following the same pattern as the existing `__add__` implementation in `stensor.

API/Element-wise/Binary arithmetic
feature_transposefeature

Transpose

Add a `transpose()` method and `.

API/Shape & Layout/Transpose & permute
feature_sum_reductionfeature

Sum Reduction

Implement `sum(axis=None)` on `STensor` and as a standalone function in `ops.

API/Reductions & Scans/Aggregate
feature_sddmmfeature

Sddmm

Implement `sddmm(S, A, B)` as a first-class operation in `ops.

API/Linear Algebra/Matmul variants
feature_autogradfeature

Autograd

Add automatic differentiation support for scorch's core operations by implementing custom `torch.

API/ML Primitives/Autograd
feature_unary_opsfeature

Unary Ops

Add compiler-level support for unary operations on sparse tensors.

API/Element-wise/Unary math
Failure modes

How frontier agents fail

Even on its 70 failures, the leading agent rarely breaks existing tests — most failures are partial new-test passes. Weaker agents skew toward breaking the original suite without contributing working tests.

Claude Code Opus 4.770 failures
GPT 5.5 Codex xhigh82 failures
Qwen3-Coder155 failures
Some new tests failBroke existing + partial newBroke existing, new passBroke existing + all new failBroke existing, no new testsAll new tests fail