What does eval platform governance actually cover?

Eval platform governance covers the receipts behind the score: the scope ledger for what the run was allowed to do, the replay transcript behind the green number, and connector cards for the harness itself. A platform without those grades runs nobody can inspect.

Can eval scores replace code review?

No. A score reports an outcome; review explains a change. The moment a score gates merges it becomes a target, and Goodhart's law starts bending it away from quality. Keep scores as instruments and keep receipts as the thing review actually reads.

How do you stop Goodhart drift in agent evals?

Attach receipts to every scored run so the number stays auditable: ledger, transcript, diff summary. When a score moves, someone should be able to open the runs behind it and say why. Drift survives in dashboards; it dies in transcripts that reviewers actually open.

Eval platform governance for AI coding teams

The merge train clogs while the eval dashboard glows green, and the retro repeats the same apology. That gap is what eval platform governance exists to close: scores climbed, skills drifted faster than code review could absorb, and nobody could say what the numbers were measuring. Eval platform governance is the set of receipts, scopes, and owners that keep agent quality scores tied to what actually ran. A score without receipts is a press release.

The dashboard that outlived its meaning

Trust does not scale when receipts stay in chat, and eval platforms inherit that law the day a score starts gating merges. After watching merge trains clog, we keep hearing the same apology in retro: the evals were green, so duplicated edits nobody reconciled kept flowing toward main.

Counter-thesis: An eval platform does not govern your agents; ungoverned, it hands the team a number to hide behind while skills drift.

The wrong path: We believed smaller tasks guaranteed safer autonomy, then let eval scores stand in for review. We repeated it until PR archaeology replaced architecture conversation.

Diagnosis: Goodhart's law. When a measure becomes a target it stops being a good measure, and an eval score becomes a target the moment it decides what merges.

Thesis: Receipts beat raw autonomy, and a score is not a receipt.

Governance the score depends on

Every fix here answers one question: what did the scored run actually do?

Claude scope fog. .mdc language sounds precise until reviewers argue what it meant, and an eval over a run with mushy scope grades noise. Claude's agent docs describe the rule mechanism; the scope still needs writing.

Named fix: Scope ledger. Five lines per run: goal, allowed paths, forbidden paths, verification command, merge owner. The platform can now show what the scored run was allowed to do.

Claude permission creep. Bash approvals become muscle memory on shared laptops, sessions invent policy mid-run, and the eval ends up grading a policy nobody wrote. Claude Code's getting started guide covers where that policy file lives.

Named fix: CLAUDE.md supremacy clause. Precedence written at the top: which hooks win, which folders require human eyes, where temporary overrides live. The graded behavior traces to a stated rule.

Codex replay gaps. Merged greens from the Codex CLI where reviewers never saw the transcript, with a score attached, are exactly as trustworthy as they sound.

Named fix: Replay sandwich. Intent line, command transcript, diff summary in the PR, mandated by AGENTS.md. The number gets a narrative behind it.

MCP blast radius. The eval harness is software with connectors of its own, and one of them will eventually touch data nobody listed on the diagram.

Named fix: Connector card. Allowed actions, forbidden actions, owner, rollback, one card per server, harness included. Incidents shrink because operators know what "off" looks like.

---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
  - "**/*"
alwaysApply: false
---

- Claude: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.

Use our methodology as the forcing function: Test proves behavior, Review proves the team can explain it, and a score is neither until receipts connect them. That is the standing argument of agentic coding governance, and it extends to continuous delivery in always-on AI code review governance.

Four gates before a score gates anything

A score earns gating power only after these four questions have boring answers.

Gate	Question
Rules precedence	Which `.mdc`, `SKILL.md`, or `CLAUDE.md` governed behavior?
Connector truth	Which MCP servers fired, and were they expected?
Reviewer path	Can someone unfamiliar trace intent without chat replay?
Risk routing	Were red folders touched, and who approved?

Review strip

Primary-doc links were smoke-checked after publishing edits.
MCP connectors mentioned (if any) list owners.
Verification command output is pasted or linked.
Forked agent work lists parent and child responsibilities.

Synthesis: An eval score is a thermometer, not a thermostat; retros keep rhyming until someone writes the boring stuff down.

Boundary note

If your repo cannot state boundaries plainly, agents will guess, and an eval will grade the guessing kindly. The OWASP Top 10 for LLM applications and the NIST AI Risk Management Framework belong next to any platform decision that touches risk.

Best ways to use this research

Best for: engineering teams comparing Claude, Claude Code, and Codex operating habits before letting any eval score gate a merge.
Best first artifact: a connector card for the eval harness itself, written before the next scored run.
Best comparison angle: pick one green-scored run and try to reconstruct it from receipts alone; keep the platform setup that makes this take minutes.

Common questions

What does eval platform governance actually cover?

Eval platform governance covers the receipts behind the score: the scope ledger for what the run was allowed to do, the replay transcript behind the green number, and connector cards for the harness itself. A platform without those grades runs nobody can inspect.
Can eval scores replace code review?

No. A score reports an outcome; review explains a change. The moment a score gates merges it becomes a target, and Goodhart's law starts bending it away from quality. Keep scores as instruments and keep receipts as the thing review actually reads.
How do you stop Goodhart drift in agent evals?

Attach receipts to every scored run so the number stays auditable: ledger, transcript, diff summary. When a score moves, someone should be able to open the runs behind it and say why. Drift survives in dashboards; it dies in transcripts that reviewers actually open.

Next move

The governance model behind this memo, receipts and gates included, is laid out in our white paper for the colleague who owns the platform budget.

Eval platform governance for AI coding teams

The dashboard that outlived its meaning

Governance the score depends on

Four gates before a score gates anything

Boundary note

Best ways to use this research

Common questions

Further reading

Next move

Related training topics

Related research

AI agent boundaries that hold under pressure

Agent boundaries for teams running coding agents

How to set up an AI coding workshop for your engineering team

Continue through the research archive

Claude Code 2.1.126 team conventions

Agent boundaries for teams running coding agents

Ready to start?

The dashboard that outlived its meaning

Governance the score depends on

Four gates before a score gates anything

Boundary note

Best ways to use this research

Common questions

Further reading

Next move

Related training topics

Claude Code skills and team delegation

Claude Code MCP and team conventions

Claude Code CLI workflows for production codebases

MCP training for engineering teams: servers, skills, workflows

Related research

AI agent boundaries that hold under pressure

Agent boundaries for teams running coding agents

How to set up an AI coding workshop for your engineering team

Continue through the research archive

Claude Code 2.1.126 team conventions

Agent boundaries for teams running coding agents

Ready to start?