I write about software development, AI assisted engineering workflows, architecture, automation, running, and the occasional side trail.

Stronger Models Help Most After the Problem Has Been Shaped

One of the more subtle failures in AI-assisted development is not that the model gives a bad answer. It is that the model gives a good-looking answer before the team has found the right question. That failure is especially easy to miss in architecture work, where a polished decomposition can make unresolved boundaries feel settled, a confident recommendation can make tradeoffs look evaluated, and a clean diagram can make a design look real before its responsibilities have survived allocation.

The risk is not simply that AI might be wrong. The more interesting risk is that its output can resemble judgment while quietly skipping the work that judgment requires. In earlier posts I wrote about stop conditions, multi-model review, and the way AI makes missing judgment more expensive. This article continues that thread by looking at a related pattern: stronger models can be extremely useful, but their usefulness depends heavily on when they enter the workflow and what role they are allowed to play.

That is why I am increasingly skeptical of automatically reaching for the strongest and most expensive AI model first for complex software design work. Model capability matters, but capability applied too early can produce persuasive structure around an under-shaped problem. Cost also matters, although not in the simplistic sense of always preferring the cheaper option. If a model is substantially more expensive to use, it should be brought in where its additional reasoning capacity is likely to change the outcome, not merely where it can produce the first plausible draft. At the same time, underinvesting in reasoning for decisions that will shape a long-lived system can be far more expensive than the model cost being avoided. In design work, the better question is how to match model capability, model cost, workflow maturity, and decision consequence.

AI Makes Missing Judgment More Expensive

AI has made it cheaper to produce work that looks complete. It has not made judgment cheaper. That difference explains much of the disappointment, cost, and confusion now appearing around AI in software development.

The problem is not that AI cannot produce useful work. I use it heavily, and it can be extremely valuable when it is constrained by a strong engineering process. The problem starts when generated output is treated as evidence that the engineering work has been completed. A system can now accumulate code, tests, documentation, plans, reviews, and diagrams faster than the organization can reliably evaluate them.

That is the larger pattern behind several of my recent posts on deterministic execution, browser verification, stop conditions, and review workflows using more than one model. Those posts described specific techniques. The broader reason those techniques matter is that AI has moved the bottleneck. The limiting factor is increasingly less about producing artifacts and more about knowing whether those artifacts are correct, coherent, maintainable, and worth keeping.

Why AI Review Needs More Than One Model

In earlier posts, I wrote about deterministic execution, browser verification, and stop conditions for AI driven workflows. Those addressed specific failure modes around execution and validation. Over time, another category of problems started to appear during review.

Using one AI to review the output of another was not a recent change in my workflow. Before cross-ai-review became a released workflow, I was already routinely having one model review implementation plans, prompts, architecture notes, and generated artifacts produced by another model before execution continued. In many cases, I would send the same instructions through multiple systems first to see where the interpretations diverged.

What changed was the degree of structure around it. The workflows moved from ad hoc experimentation into something more repeatable and governed. Different models surfaced different classes of issues, fixes introduced regressions, and ambiguity propagated downstream into later artifacts. The released cross-ai-review workflow was simply a formalization of patterns that had emerged through repeated use.

Running AI Where It Doesn’t Exist

The Problem

I was working on a Windows target where Claude simply would not run. This was not a degraded experience or a partial failure. It would not start at all due to a dependency issue. At the same time, the code I was building had to run in that environment, so avoiding the platform was not an option.

The obvious next step was to try to fix Claude on that system. I spent some time going down that path, but it quickly became clear that this was not going to be a quick fix. Even if I managed to get it working once, there was no guarantee it would continue working across updates or configuration changes. At that point, the problem started to look different.

Why AI Review Needs Stop Conditions

In the previous post, I described how structured manifests and browser verification make execution deterministic.

That solves execution. It does not solve review.

Deterministic execution without deterministic review is incomplete.