Back to Research

Specs, Tests, Stable Stacks

Clear specs, good tests, and stable stacks make agentic coding more reliable.

Hero image for Specs, Tests, Stable Stacks
Rogier MullerApril 7, 20265 min read

Agentic coding works better when the task is constrained. The practical point is simple: write detailed specs, add automated tests, and prefer older, well-understood technologies where possible.

That is not a universal rule. It is a workflow pattern. Agents do better when the target is explicit, the feedback loop is fast, and the codebase does not force them to infer too much from unstable conventions.

The basic idea is to reduce ambiguity before the agent starts, then reduce it again after it makes changes.

Why specs matter first

A coding agent is not a product manager. It will not reliably infer hidden requirements from a vague request. If the task is “improve checkout,” the agent has to guess what counts as done, what edge cases matter, and what tradeoffs are acceptable. That guesswork is where quality drops.

A useful spec does three things:

  • states the user-visible goal
  • lists constraints and non-goals
  • defines acceptance criteria in testable terms

For example, instead of asking for “better form validation,” specify which fields must validate, when errors should appear, what should happen on submit, and which cases are out of scope. The more concrete the spec, the less the agent has to invent.

This also helps human reviewers. A good spec becomes the shared reference point when the implementation drifts.

Why tests are the real guardrail

Automated tests are not just a quality check at the end. In agentic workflows, they are part of the control system.

Agents are good at making plausible edits. They are less good at knowing whether those edits preserve behavior across the full surface area of the app. Tests turn that uncertainty into a signal. If the agent changes code and the tests fail, the failure is immediate and specific. If the tests pass, you still need review, but the search space is smaller.

The strongest pattern is to add or update tests before or alongside the implementation. That gives the agent a target and gives you a way to verify the result. In practice, this often means:

  • unit tests for pure logic
  • integration tests for key flows
  • end-to-end tests for user-critical paths
  • regression tests for bugs you do not want to see again

The limitation is obvious: tests only help if they are meaningful. A large suite of brittle or low-signal tests can slow the loop without improving confidence.

Why older tools can help

The source signal points toward “old industry standard technology.” That should not be read as “avoid modern tools.” It means prefer stable, familiar, well-documented stacks when the goal is agent reliability.

Older tools often have clearer conventions, fewer moving parts, and more examples in the wild. That makes them easier for both humans and agents to reason about. A mature framework with predictable patterns can be easier to modify safely than a newer stack with shifting idioms and sparse edge-case coverage.

This is especially true when the task involves:

  • large refactors n- multi-file edits
  • test-driven changes
  • code that must be maintained by a team, not just generated once

The tradeoff is that older tools may lag on ergonomics, performance, or ecosystem features. If the team already depends on newer tooling, forcing a rollback can create more friction than it removes. The point is not nostalgia. The point is predictability.

A practical workflow

A workable agentic process looks like this:

  1. Write a short spec with acceptance criteria.
  2. Add or update tests that capture the desired behavior.
  3. Ask the agent to implement against those constraints.
  4. Run the tests and inspect the diff.
  5. Tighten the spec if the result misses intent.

This loop is boring on purpose. Boring is useful when the system is making code changes on your behalf.

If you are using an editor-integrated assistant or a CLI agent, the pattern stays the same. The interface changes. The control points do not.

Where this breaks down

There are real limits to this approach.

First, some tasks are too exploratory for detailed specs. If you are still discovering the shape of the problem, over-specifying can freeze the wrong design.

Second, tests do not capture everything. Visual polish, performance nuance, and product judgment still need human review.

Third, older technologies can become a drag if the team treats stability as a reason to avoid necessary upgrades. Predictability is helpful; stagnation is not.

So the rule is not “always use old tools.” It is “prefer the least surprising stack that still solves the problem.”

A useful way to think about it

Agentic coding improves when the work is made legible. Specs make intent legible. Tests make behavior legible. Mature tools make the codebase legible.

That combination does not remove the need for judgment. It just gives the agent fewer chances to wander.

Methodology note

This is mostly a Test problem: if the agent can verify its own changes quickly, the workflow becomes much easier to trust. That is the same reason we keep test design close to implementation in our methodology.

Bottom line

If your agentic workflow feels flaky, do not start by blaming the model. Start by checking the inputs and the feedback loop.

Write the task down clearly. Encode the expected behavior in tests. Use a stack the team can reason about without guessing. Those three moves will not solve every problem, but they remove a lot of avoidable failure.

Want to learn more about Claude?

We offer enterprise training and workshops to help your team become more productive with AI-assisted development.

Contact Us