Benchmark task packets

Terminal-Bench CLI Task Authoring

Clean benchmark work needs more than a prompt. Photon101 packages CLI tasks with deterministic instructions, a runnable scaffold, pytest acceptance checks, a golden solution, and a short audit of what could make the task flaky.

Buy on Freelancer Buy via the402 Proof repo

AI code review and benchmark packet portfolio graphic

Inputs. A CLI skill area, target behaviour, permitted tools, expected files, hidden-test constraints, and any benchmark format rules.
Output. A task packet with spec, fixture files, tests, solution script, run notes, and a reproducibility score.
Best fit. AI labs, evaluation teams, hiring screens, model vendors, and benchmark maintainers who need small tasks that can actually be graded.

What You Get

Task instructions written for a model or candidate to execute without extra clarification.
Docker-ready or container-aware project scaffold with file layout notes.
Pytest acceptance checks that cover the required behaviour and reject common shortcuts.
Golden solution script so the task can be verified before it enters a benchmark pool.
Reproducibility audit covering network assumptions, strict shell handling, metadata, and ambiguous instructions.

Proof

The public starter repo includes a sample CLI task, audit logic, Markdown and JSON output, and local checks. It has been validated with npm test, npm run demo, ShellCheck, and an end-to-end pytest run after applying the golden solution.