Fast Evals for Better Decisions

Evals only help when they fit the pace of the work. If they take too long, people skip them. If they are too vague, people ignore the result. The useful middle is a small set of checks that run quickly, answer one question, and change what you do next.

Teams often build evals that are too broad or too expensive. They measure everything, then learn nothing. Or they measure one narrow case and mistake it for coverage. The result is a dashboard that looks serious but does not improve decisions.

A better pattern is to treat evals as part of the edit loop. The goal is not a perfect score. The goal is to reduce uncertainty before you merge, ship, or hand off work to another agent.

Start with one decision

Before writing an eval, name the decision it should support. For example:

Is this prompt change better than the last one?
Did the agent stop making this class of mistake?
Is the new retrieval step actually helping?
Did latency improve without hurting output quality?

If you cannot name the decision, the eval is probably too broad. A good eval has a clear failure mode and a clear action attached to it. That makes it easier to keep the test small.

Many teams overbuild here. They create a benchmark suite before they have a stable question. A smaller, decision-linked eval is usually more useful than a large one that runs once a week.

Keep the loop short

Fast evals matter because they change behavior. When a check finishes in minutes, engineers are more likely to rerun it after each change. When it takes hours, it becomes a release-time ritual, not a development tool.

In one public example, Awni Hannun noted that GLM 4.6 ran quite fast on an M3 Ultra with mlx-lm, even at higher precision. I cannot verify that benchmark from the source alone, but the broader point is simple: lower latency makes iteration easier when the model is part of the eval loop.

The practical takeaway is not to chase a specific model. It is to choose an eval setup you can afford to run repeatedly. If the check is slow, simplify the task, reduce the sample set, or move the expensive part out of the inner loop.

Use a layered setup

A useful eval stack usually has three layers.

First, a tiny smoke check. This catches obvious regressions quickly. It should be cheap enough to run often.

Second, a focused task set. This is where you test the specific behavior you care about, such as tool use, patch quality, or instruction following on your codebase.

Third, a slower review set. This is for cases where human judgment still matters, such as ambiguous edits or user-facing behavior.

The layers should answer different questions. If every eval tries to do everything, the signal gets muddy. If the smoke check fails, you do not need the full suite. If the focused set is stable, you may not need to rerun the slow review every time.

Measure what changes decisions

A common mistake is to optimize for a single aggregate score. That can hide the cases that matter most. For agentic coding, the more useful measures are often operational:

pass rate on a narrow task class
number of retries before success
time to a reviewable patch
rate of tool misuse
frequency of silent failures

These are examples, not a universal list. Use metrics that connect to work. If a number does not change what the team does, it is probably not worth tracking at high resolution.

Keep a small set of representative failures too. When a model regresses, the fastest way to understand it is often to inspect a few concrete cases rather than a summary chart.

Expect tradeoffs

Fast evals are not free. The main tradeoff is fidelity. A smaller, cheaper eval can miss edge cases. A local setup can differ from production conditions. A narrow task set can overfit to the benchmark.

That does not make fast evals bad. It means they should be treated as a filter, not a final verdict. Use them to catch obvious problems early, then reserve slower or more realistic checks for the decisions that justify the cost.

Another tradeoff is maintenance. Once an eval becomes part of the workflow, it needs care. If the task distribution changes, the eval can drift. If the prompt or tool interface changes, the old cases may stop reflecting reality. A stale eval is worse than no eval, because it gives false confidence.

A practical setup

If you are building this into an agentic coding workflow, start small:

Pick one recurring failure mode.
Write 5 to 20 cases that expose it.
Define one pass/fail rule that a teammate can apply consistently.
Make the run cheap enough to repeat during development.
Keep a short note on what the eval is meant to decide.

That last step matters. Without it, teams forget why the eval exists and start treating the score as the goal.

Review the result, not just the number

A score can tell you that something changed. It cannot always tell you why. When the result shifts, inspect the examples. Look for patterns in tool use, prompt wording, context size, or task shape. Often the fix is not a bigger benchmark. It is a better split between the cases you can automate and the cases that still need review.

That is also why the Review step matters here. A quick human pass on a few failures can be more informative than another round of aggregate scoring.

What to keep in mind

Fast evals work best when they are close to the work, cheap to rerun, and tied to a real decision. They are not a substitute for deeper validation. They are a way to keep the team moving without guessing.

If your evals are slowing down the loop, they are probably too heavy. If they are not changing behavior, they are probably too vague. The useful version is small, specific, and easy to repeat.