AI code judgement
AI Coding Evaluation and PR Handoff
Photon101 compares agent-written patches against the evidence a maintainer or buyer actually needs: acceptance coverage, test results, scope control, review risk, verification commands, and handoff quality.
- Inputs. Candidate patches, model transcripts, CI logs, PR review threads, acceptance criteria, repo instructions, and verification commands.
- Output. A scored recommendation with concrete evidence, failure modes, residual risk, and maintainer-ready handoff text.
- Best fit. Agent-output comparisons, PR rescue decisions, hiring screens, and teams choosing which generated patch to trust.
Sample Scorecard
Example task: compare two AI patches for a flaky invoice-export CI failure.
| Candidate | Evidence | Risk | Decision |
|---|---|---|---|
| Candidate A | Focused test passed, but full-suite evidence is missing and the handoff is thin. | Timezone regression risk is not called out; the reviewer has to infer next checks. | Do not merge yet. |
| Candidate B | Focused test, full suite, timezone-specific verification, and clean diff scope. | Residual risk is explicit and bounded to date parsing edge cases. | Recommended. |
Deliverables
- Winner recommendation with why it is safer than the alternatives.
- Acceptance-criteria coverage mapped to concrete evidence.
- CI, test, lint, and typecheck summary with missing verification called out.
- Scope-control and risk review, including broad rewrites and unrelated churn.
- Maintainer-ready handoff notes that can be pasted into a PR or client update.
Proof
The public starter repo includes a dependency-free Node CLI, sample fixture, JSON and Markdown output, and secret redaction for common token patterns. Run it with npm test, npm run demo, or node bin/code-eval.mjs fixtures/sample-evaluation.json --format markdown.