Back to Research

Tools That Hold Up

A practical look at coding tools that stay useful after the demo.

Editorial illustration for Tools That Hold Up. A tool can look strong in a short demo and still fail in daily use.
Rogier MullerApril 11, 20265 min read

A tool can look strong in a short demo and still fail in daily use. That gap matters more in agentic coding than in most software categories. The real test is not whether a model can produce a convincing answer once. It is whether the tool keeps working when the task is messy, the repo is old, the tests are incomplete, and the team needs repeatable results.

The source signal here points to a simple idea: some coding tools are expensive, but price is not the main issue. The bigger question is whether the tool earns its place in a loop. If it saves time only when the task is clean, it is not durable. If it helps on partial context, ambiguous prompts, and multi-step edits, it is closer to production value.

What “holds up” means

For agentic coders, a tool holds up when it improves one of three things without adding friction: task completion, review quality, or iteration speed. Many tools only improve the first demo of a task. They produce a plausible patch, then fall apart on follow-up edits, cross-file changes, or test repair.

A practical definition is this: after the novelty wears off, can the tool still help a developer move from intent to merged change with less rework? If yes, it is useful. If no, it is a cost center with a good landing page.

The checks that matter

Teams do not need a huge benchmark suite to judge this. They need a few repeatable checks that reflect real work.

  • Can it handle a medium-sized change across multiple files?
  • Does it preserve local conventions, or rewrite code into a new style?
  • Can it recover after a failed step without losing the thread?
  • Does it make review easier, or create more cleanup?
  • Does it still help when the repo has weak tests or unclear boundaries?

These checks are more useful than raw output quality because they expose the hidden cost of adoption. A tool that is slightly weaker on first-pass code but much better at staying on task may outperform a flashier system in practice.

Where expensive tools can still be worth it

A high price is not automatically a problem. It becomes a problem when the tool does not reduce enough labor to justify the spend. The cases where teams usually get value are narrow but real.

One is long, multi-step work where context retention matters. Another is tasks that require repeated edits and verification, not just one-shot generation. A third is situations where the tool helps a developer stay in flow by reducing manual search, file hopping, or repetitive patching.

That said, expensive tools often fail in predictable ways. They may be strong at synthesis but weak at precision. They may produce good-looking plans but shaky implementation. They may work well in a fresh repo and degrade in a legacy codebase. None of that is surprising. It just means the evaluation has to match the work.

A practical adoption pattern

If a team wants to test whether a tool holds up, start small and use real tasks.

  1. Pick three recent changes that were representative, not ideal.
  2. Run them through the tool with the same constraints the team normally uses.
  3. Measure how much human correction was needed.
  4. Check whether the output was easy to review.
  5. Repeat after a week, not just on day one.

That last step matters. Many tools look better in the first session because the team is still adapting its prompts and expectations. A week later, the question is whether the tool still fits the workflow when the novelty is gone.

Tradeoffs to expect

There is no free lunch here. Tools that are more autonomous can save time, but they can also create more uncertainty about what changed and why. Tools that are more constrained may be slower, but they can be easier to trust. Tools that are good at broad reasoning may still need strong guardrails for edits, tests, and file selection.

This is why teams should avoid judging tools on a single axis. Speed without reviewability is risky. Reviewability without speed may not justify adoption. The right balance depends on the task class.

A useful rule: if the tool cannot explain its own changes clearly enough for a teammate to review, the output is not ready for routine use. That does not mean the tool is bad. It means it is not fit for the part of the loop where teams spend the most time.

What to document while testing

Keep notes on the failures, not just the wins. Record where the tool drifted, where it needed manual steering, and where it saved time in a way the team could see. A short review note linked to our methodology is often enough to make the next comparison more honest.

The point is not to build bureaucracy. It is to avoid memory bias. Teams remember the impressive demo and forget the cleanup.

Bottom line

AI coding tools are not judged well by their best case. They are judged by whether they keep helping after the first patch, the first failure, and the first review. The tools that hold up are usually not the loudest ones. They are the ones that stay useful when the work becomes ordinary, repetitive, and slightly messy.

That is the standard worth using. Not “can it do something impressive?” but “can it keep doing useful work when the repo stops being polite?”

Related research

Ready to start?

Transform how your team builds software today.

Get in touch