Elementwise Mul
Implement element-wise multiplication (`__mul__`) on `STensor` through the full CIN compilation pipeline, following the same pattern as the existing `__add__` implementation in `stensor.
TensorBench asks an agent to add a feature to a real sparse-tensor compiler, then runs the full pytest suite inside Docker. A task is solved only when every test — original and agent-added — passes. Even the strongest agent solves just 64.8% — frontier coding agents still fail on more than a third of the suite.
Pass rate across the 199-task suite. Each task is graded by running the full pytest suite inside Docker on the patched repo.
| # | Agent | Vendor | Scaffold | Pass rate | Solved | Broke existing |
|---|---|---|---|---|---|---|
| 1 | Claude Code Opus 4.7SOTA | Anthropic | Claude Code | 64.8% | ||
| 2 | GPT 5.5 Codex xhigh | OpenAI | Codex CLI | 58.8% | ||
| 3 | Claude Code Opus 4.6 | Anthropic | Claude Code | 42.7% | ||
| 4 | GPT 5.4 Codex xhigh | OpenAI | Codex CLI | 38.7% | ||
| 5 | GPT 5.3 Codex xhigh | OpenAI | Codex CLI | 36.2% | ||
| 6 | Gemini 3.1 Pro Preview | Gemini CLI | 31.7% | |||
| 7 | Qwen3-Coder | Alibaba | OpenHands | 22.1% |
Tasks ask the agent to implement something the codebase doesn't yet support — a new operator, a new format, a new pass. The agent typically writes both the implementation and its tests; the task is solved only when every test passes.
Implement element-wise multiplication (`__mul__`) on `STensor` through the full CIN compilation pipeline, following the same pattern as the existing `__add__` implementation in `stensor.
Add a `transpose()` method and `.
Implement `sum(axis=None)` on `STensor` and as a standalone function in `ops.
Implement `sddmm(S, A, B)` as a first-class operation in `ops.
Add automatic differentiation support for scorch's core operations by implementing custom `torch.
Add compiler-level support for unary operations on sparse tensors.
Even on its 70 failures, the leading agent rarely breaks existing tests — most failures are partial new-test passes. Weaker agents skew toward breaking the original suite without contributing working tests.