Why TensorBench
Existing coding benchmarks reward closing GitHub issues or refactoring under exact-match preservation. Neither captures what we actually want from a coding agent: extending a real codebase with a real feature, in a way the original test suite — plus new tests written for the new behavior — can verify.
TensorBench is a Stanford research project. It targets a single codebase: a sparse-tensor compiler written in Python and C++. The 199 tasks ask agents to add new operators, new storage formats, new compilation passes, and new ML primitives. Tasks are graded by running the full pytest suite inside Docker on the patched repo. A task is solved only when failed == 0 — every original test still passes, and every new test the agent wrote passes too.
The grading rule is strict on purpose. It catches the most common failure mode of the current frontier: an agent that ships a partial implementation, writes tests that pass for the easy cases, and silently breaks an existing test elsewhere in the suite.
What is in the suite
- 194 feature tasks — implement a missing piece of functionality.
- 5 refactor tasks — reorganize a file or subsystem without changing observable behavior.
- Tasks span element-wise ops, reductions, linear algebra, shape/layout transforms, autograd, and compiler-level concerns (IR nodes, lowering passes, codegen).
Methodology
For each (agent, task) pair we generate a patch, apply it to a clean checkout at the task's base commit, build the C++ extension, and run the test suite. Verdicts are produced by a custom grading strategy that parses verbose pytest output and reports per-test pass/fail counts before and after the patch.
Citation
A short paper accompanies the benchmark — under double-blind review at NeurIPS 2026. A BibTeX entry will appear here once the paper is public.