The cheapest model that passes the eval wins

Most teams come into the build with a model already chosen. Usually the frontier model of whichever vendor sold them their last engagement. The eval does not care which vendor was in the room. The eval cares which model passes.

The matrix we run, not the matrix we read about

For every production path, we score the eval against a matrix of candidates. The matrix tends to include three frontier-class models from different vendors, two or three mid-tier models, one small fast model, and at least one open-weights option deployed on our own infrastructure. The score is not a single number. It is four columns: quality against the eval, p95 latency, dollars per million tokens, and behavior under malformed or adversarial input.

A model does not "win" by scoring highest on quality. A model wins by being the cheapest one that passes the thresholds the business wrote down.

What tends to actually win

In the builds we have shipped, the frontier model wins on quality more often than not. It wins on the full scorecard less often than the room expects.

The pattern that keeps repeating: a mid-tier model with a tighter prompt and one structured-output retry passes the same threshold as the frontier model, runs in a third of the latency, and lands the production path under $0.50 per million tokens. The frontier model lands the same path at roughly four to five times that cost. The eval does not care that the frontier model was more impressive in isolation. The eval picks the cheaper one because that is the rule.

Why this matters beyond the cost line

The model bill is the visible win. It is not the most important one.

The most important win is portability. When a new model lands — and one lands every six to ten weeks now — re-running the eval matrix is a Wednesday afternoon, not a migration project. The team owns the harness. The harness picks the model. There is no vendor lock-in to negotiate out of, because the eval is the thing the system was built against, not the model.

The second-most-important win is honesty under stress. Production load is not the load the demo ran under. Smaller models tend to degrade more predictably when load goes up, retries fire, or the input gets weird. The matrix exposes that in week three instead of in the first incident.

What this is not

This is not an argument against frontier models. Frontier when we must. The agents that need long-horizon reasoning, the workflows where a wrong answer is catastrophic, the tasks where the cheaper models genuinely fail the eval — those run on the frontier model. The decision is made by the scorecard, not by the salesperson who landed the room first.

The factory rule is straightforward: the eval makes the call. The cheapest model that passes wins. The vendor relationship is downstream of the eval, never upstream.