The 3 a.m. AI runbook

An AI system is not production-ready because it has a deployed endpoint. It is production-ready when an on-call engineer can tell whether quality is degrading, what changed, who owns the decision, and how to roll back without waiting for the team that built it.

That is why the runbook is not paperwork. It is part of the product.

What breaks after the demo

The demo path usually tests one successful answer. Production tests the system boundary all day:

A retrieval path returns the wrong source with high confidence.
A model provider slows down or changes behavior.
A prompt change improves the happy path and breaks refusals.
A customer uploads data the original examples did not cover.
A workflow starts costing 4x more because a frontier model is being used where a smaller model would pass the eval.
A user asks for something the system must escalate, not answer.

Ordinary uptime monitoring will miss most of this. The endpoint can be green while the product is wrong.

The alert has to name the operating failure

An AI runbook that says "LLM error rate high" is not enough. The person on call needs to know what kind of failure they are holding.

Quality drift: the eval or sampled review path is degrading against a known threshold.
Retrieval failure: the system is answering from weak, stale, or missing sources.
Tool failure: a tool call is malformed, slow, rate-limited, or returning unexpected schema.
Policy failure: the system is answering a request it should refuse or escalate.
Cost failure: a route is burning tokens outside the budget written for that workflow.
Latency failure: the system is technically correct but too slow for the user to stay in the flow.

Those are different incidents. They have different owners, rollback paths, and customer impact. The runbook has to separate them.

What belongs in the runbook

Every production AI handover should include:

Quality signals: eval thresholds, sampled review paths, and drift indicators.
Failure modes: retrieval miss, hallucinated answer, unsafe output, malformed tool call, latency spike, cost spike, provider outage.
Escalation paths: when to wake engineering, product, legal, security, or a human reviewer.
Rollback instructions: prompt version, model route, retrieval index, feature flag, and data migration state.
Known limits: the cases the system is not allowed to handle.

The goal is not to predict every incident. The goal is to make the first response boring.

The first fifteen minutes

For most AI incidents, the first responder should not start by editing the prompt. They should answer the same five questions every time:

Is the failure in model output, retrieval, tool execution, policy, cost, or latency?
Did a prompt, model route, retrieval index, tool schema, or data source change recently?
Is there a feature flag or route-level rollback?
Is a human reviewer or domain owner needed before service resumes?
Should the failed examples enter the blocking set, monitoring set, or backlog?

The last question matters. Every incident is either noise or training data for the operating system. The runbook decides which.

The handover test

Before we leave a build, your team should be able to run the system live while we stand behind them. If they cannot diagnose a bad answer, rerun the eval, identify the owner, and roll back the risky path, the handover is not done.

Production AI is not a prompt. It is an operating surface.