Stop Adding Bug Tests

A common reflex after a production bug is to add a test that reproduces the failure. It feels right. The bug is fixed, the test passes, and the issue should not return.

As a long-term habit, though, this is weak. If every production bug becomes a one-off test, the suite fills up with narrow edge cases. Many of those tests lock in the exact bad input, broken path, or accidental state that caused the incident. That can help once. It is not a good default.

The problem is not bug-reproduction tests themselves. The problem is treating them as the final step. They catch one failure mode, but they rarely improve the system model. Over time, you get a large suite that is harder to maintain and still misses nearby failures.

Why the pattern breaks down

A bug-specific test usually comes from the incident itself: a payload, a call sequence, a UI path, or a data shape that failed in production. That makes it precise. It also makes it brittle.

If the implementation changes, the test may fail for reasons unrelated to the original bug. If the bug came from a family of inputs, the test only covers one point in that family. If the bug exposed a missing invariant, the test may pass while the real gap stays open.

This shows up fast in agentic coding workflows, where tests are easy to generate. Speed helps, but it also makes overfitting easy. A reproduction test can be written in minutes. That does not mean it is the right protection.

A better sequence

Treat the production bug as a signal, not the final test case.

First, reproduce the issue in the smallest useful form. That may be a test, but it may also be a script, a fixture, or a manual check while you understand the failure.

Then ask what actually broke:

a missing invariant
an unhandled input class
a race or ordering assumption
a boundary condition
a contract mismatch between layers

Once you know that, write the test at the level of the invariant, not the incident. Protect the rule that should have prevented the bug, not just the path that exposed it.

For example, if a bug came from a null field in a payload, a narrow test might check only that one payload. A stronger test might cover absent, empty, and malformed values. If the bug came from a state transition happening too early, the better test is usually about allowed transitions, not the one timing sequence you saw in prod.

What to keep from the bug-specific test

You do not need to throw away the reproduction. Keep it if it is useful as a regression check or a debugging artifact. But do not stop there.

A practical rule is to keep three layers in mind:

The exact reproduction, if it is cheap and stable.
The invariant test, which covers the broader rule.
The higher-level check, such as an integration or end-to-end test, if the bug crossed boundaries.

Not every bug needs all three. But if the only test you add is the exact reproduction, you are probably under-testing the real risk.

How this helps agentic teams

Agentic teams often optimize for throughput: generate code, generate tests, verify, move on. That works until the suite turns into a record of old incidents. Then every change has to satisfy a growing set of tests that each encode a local story.

A better workflow is to let the agent draft the reproduction, then ask: what is the invariant here? What class of inputs should this cover? What is the smallest test that would have failed before the bug, but still makes sense after the implementation changes?

This is a good place for a short review step. In our methodology, that maps to Review: check whether the test protects the rule or only the incident.

Practical implementation steps

When a bug lands in production, use this sequence:

Reproduce the failure in the smallest form you can.
Write down the invariant the system should have upheld.
Decide whether the right protection is unit, integration, or end-to-end.
Prefer a test over a single input when the bug came from a class of inputs.
Keep the exact reproduction only if it adds debugging value or guards a fragile path.
Remove or merge tests that duplicate the same invariant in slightly different clothing.

This is not about fewer tests at any cost. It is about better-shaped tests.

Tradeoffs and limits

There are cases where a one-off regression test is the right move. If the bug is rare, costly, and tied to a very specific external condition, the exact reproduction may be the most practical guard you can add. In fast-moving systems, a narrow test can also be a useful stopgap while you design a broader fix.

The risk is accumulation. Once the suite contains many of these tests, it becomes hard to tell which ones protect important behavior and which ones only preserve old incidents. That makes refactoring harder and can slow down agent-driven changes, because the agent has to satisfy a lot of brittle historical detail.

So the rule is not “never add bug tests.” The rule is “do not stop at the bug test.” Use it as a starting point, then turn it into coverage that matches the real failure mode.

The practical takeaway

If a production bug becomes a single exact test, you may be preserving the symptom instead of the cause. Better teams turn incidents into invariants, then into tests that still make sense after the code changes.

That keeps the suite smaller, clearer, and more useful when agents are generating and modifying code quickly.

Stop Adding Bug Tests

Why the pattern breaks down

A better sequence

What to keep from the bug-specific test

How this helps agentic teams

Practical implementation steps

Tradeoffs and limits

The practical takeaway

Related research

Specs, Tests, Stable Stacks

Fast Evals for Better Decisions

Stop Using CSS Selectors in E2E Tests

Ready to start?