What a real user broke on day twelve that no spec would have caught

A proof of concept that nobody outside the build team has touched is not a proof. It is a slide deck with a working demo attached. By week two of every engagement, we put a real human inside whatever exists so far and watch them use it.

Who the user is

Not a beta tester. Not the executive sponsor. Usually one of the operators who would eventually live inside the system — a support lead, a sales engineer, an analyst. Sometimes a single user. Always someone whose day job the system is supposed to change.

We sit with them. We do not interrupt. We do not explain. We watch.

What gets exposed in twenty minutes

On the whiteboard, the architecture is clean. Confidence-gated handoff to a human. Inline citations on every answer. A retry policy for malformed tool calls.

In the first twenty minutes of a real user touching the thing, here is what tends to actually happen. They copy the output into a Word document or an email draft. They ignore the citations. They never trigger the confidence threshold because they have already given up and gone back to the old workflow before the threshold is reached. They retry by hitting the button three times instead of waiting for the structured retry.

None of that was in the spec. All of it changes the architecture.

Three patterns we now expect

The first pattern: output destination matters more than output quality. If the user is going to paste the answer somewhere else, the rendering, the structure, and the metadata of the output have to match where it is going. A confident answer in the wrong format is still extra work.

The second pattern: latency budgets are not abstract. A confidence-gated handoff that takes seven seconds is a confidence-gated handoff that nobody triggers, because the user has already moved on. The latency budget gets written into the eval after week two, never before.

The third pattern: the failure mode the team is most worried about is rarely the one the user encounters first. Engineering teams obsess over hallucination. Users get blocked on data freshness, ambiguous phrasing, and tool-call retries that look like the system froze.

What changes after week two

Every week between "the system exists" and "a real user has touched the system" is a week of architectural debt the team does not yet know it is accruing. The fix is not better discovery. The fix is to shorten the gap.

Architecture decisions get made against a transcript of a real user struggling, not against a diagram on a whiteboard. The transcript is the source of truth. The whiteboard is a draft of what we used to believe.