Eval platform governance for AI coding teams
A governance memo on eval platform governance: receipts behind scores, scoped harness access, and owners that stop Goodhart drift.

The merge train clogs while the eval dashboard glows green, and the retro repeats the same apology. That gap is what eval platform governance exists to close: scores climbed, skills drifted faster than code review could absorb, and nobody could say what the numbers were measuring. Eval platform governance is the set of receipts, scopes, and owners that keep agent quality scores tied to what actually ran. A score without receipts is a press release.
The dashboard that outlived its meaning
Trust does not scale when receipts stay in chat, and eval platforms inherit that law the day a score starts gating merges. After watching merge trains clog, we keep hearing the same apology in retro: the evals were green, so duplicated edits nobody reconciled kept flowing toward main.
Counter-thesis: An eval platform does not govern your agents; ungoverned, it hands the team a number to hide behind while skills drift.
The wrong path: We believed smaller tasks guaranteed safer autonomy, then let eval scores stand in for review. We repeated it until PR archaeology replaced architecture conversation.
Diagnosis: Goodhart's law. When a measure becomes a target it stops being a good measure, and an eval score becomes a target the moment it decides what merges.
Thesis: Receipts beat raw autonomy, and a score is not a receipt.
Governance the score depends on
Every fix here answers one question: what did the scored run actually do?
Claude scope fog. .mdc language sounds precise until reviewers argue what it meant, and an eval over a run with mushy scope grades noise. Claude's agent docs describe the rule mechanism; the scope still needs writing.
Named fix: Scope ledger. Five lines per run: goal, allowed paths, forbidden paths, verification command, merge owner. The platform can now show what the scored run was allowed to do.
Claude permission creep. Bash approvals become muscle memory on shared laptops, sessions invent policy mid-run, and the eval ends up grading a policy nobody wrote. Claude Code's getting started guide covers where that policy file lives.
Named fix: CLAUDE.md supremacy clause. Precedence written at the top: which hooks win, which folders require human eyes, where temporary overrides live. The graded behavior traces to a stated rule.
Codex replay gaps. Merged greens from the Codex CLI where reviewers never saw the transcript, with a score attached, are exactly as trustworthy as they sound.
Named fix: Replay sandwich. Intent line, command transcript, diff summary in the PR, mandated by AGENTS.md. The number gets a narrative behind it.
MCP blast radius. The eval harness is software with connectors of its own, and one of them will eventually touch data nobody listed on the diagram.
Named fix: Connector card. Allowed actions, forbidden actions, owner, rollback, one card per server, harness included. Incidents shrink because operators know what "off" looks like.
---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
- "**/*"
alwaysApply: false
---
- Claude: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.
Use our methodology as the forcing function: Test proves behavior, Review proves the team can explain it, and a score is neither until receipts connect them. That is the standing argument of agentic coding governance, and it extends to continuous delivery in always-on AI code review governance.
Four gates before a score gates anything
A score earns gating power only after these four questions have boring answers.
| Gate | Question |
|---|---|
| Rules precedence | Which .mdc, SKILL.md, or CLAUDE.md governed behavior? |
| Connector truth | Which MCP servers fired, and were they expected? |
| Reviewer path | Can someone unfamiliar trace intent without chat replay? |
| Risk routing | Were red folders touched, and who approved? |
Review strip
- Primary-doc links were smoke-checked after publishing edits.
- MCP connectors mentioned (if any) list owners.
- Verification command output is pasted or linked.
- Forked agent work lists parent and child responsibilities.
Synthesis: An eval score is a thermometer, not a thermostat; retros keep rhyming until someone writes the boring stuff down.
Boundary note
If your repo cannot state boundaries plainly, agents will guess, and an eval will grade the guessing kindly. The OWASP Top 10 for LLM applications and the NIST AI Risk Management Framework belong next to any platform decision that touches risk.
Best ways to use this research
- Best for: engineering teams comparing Claude, Claude Code, and Codex operating habits before letting any eval score gate a merge.
- Best first artifact: a connector card for the eval harness itself, written before the next scored run.
- Best comparison angle: pick one green-scored run and try to reconstruct it from receipts alone; keep the platform setup that makes this take minutes.
Common questions
-
What does eval platform governance actually cover?
Eval platform governance covers the receipts behind the score: the scope ledger for what the run was allowed to do, the replay transcript behind the green number, and connector cards for the harness itself. A platform without those grades runs nobody can inspect.
-
Can eval scores replace code review?
No. A score reports an outcome; review explains a change. The moment a score gates merges it becomes a target, and Goodhart's law starts bending it away from quality. Keep scores as instruments and keep receipts as the thing review actually reads.
-
How do you stop Goodhart drift in agent evals?
Attach receipts to every scored run so the number stays auditable: ledger, transcript, diff summary. When a score moves, someone should be able to open the runs behind it and say why. Drift survives in dashboards; it dies in transcripts that reviewers actually open.
Further reading
Next move
The governance model behind this memo, receipts and gates included, is laid out in our white paper for the colleague who owns the platform budget.
Related training topics
Related research

AI agent boundaries that hold under pressure
A boundary-setting guide to AI agent boundaries: connector cards, scope ledgers, child receipts, and decision stubs that stop permission drift.

Agent boundaries for teams running coding agents
How to set agent boundaries for teams: connector ownership, written scopes, and review receipts that keep agent diffs explainable after the session ends.

How to set up an AI coding workshop for your engineering team
How to set up an AI coding workshop: pick a format, scope it to your real repos and review habits, run hands-on labs, and leave with a shared playbook.
Continue through the research archive
Newer research
Claude Code 2.1.126 team conventions
Claude Code 2.1.126 team conventions: connector stewards, data-class tags on MCP, a weekly retro note, a skill index, and a hook budget with rollbacks.
Earlier research
Agent boundaries for teams running coding agents
How to set agent boundaries for teams: connector ownership, written scopes, and review receipts that keep agent diffs explainable after the session ends.