Benchmark task packets

Terminal-Bench CLI Task Authoring

Clean benchmark work needs more than a prompt. Photon101 packages CLI tasks with deterministic instructions, a runnable scaffold, pytest acceptance checks, a golden solution, and a short audit of what could make the task flaky.

AI code review and benchmark packet portfolio graphic
  • Inputs. A CLI skill area, target behaviour, permitted tools, expected files, hidden-test constraints, and any benchmark format rules.
  • Output. A task packet with spec, fixture files, tests, solution script, run notes, and a reproducibility score.
  • Best fit. AI labs, evaluation teams, hiring screens, model vendors, and benchmark maintainers who need small tasks that can actually be graded.

What You Get

  • Task instructions written for a model or candidate to execute without extra clarification.
  • Docker-ready or container-aware project scaffold with file layout notes.
  • Pytest acceptance checks that cover the required behaviour and reject common shortcuts.
  • Golden solution script so the task can be verified before it enters a benchmark pool.
  • Reproducibility audit covering network assumptions, strict shell handling, metadata, and ambiguous instructions.

Proof

The public starter repo includes a sample CLI task, audit logic, Markdown and JSON output, and local checks. It has been validated with npm test, npm run demo, ShellCheck, and an end-to-end pytest run after applying the golden solution.