zoff.tech

May 12, 2026

RAG does not start with embeddings. It starts with answerability.

Before you tune retrieval, prove the question can be answered from the source corpus, with a citation a human would accept.

Most RAG projects start in the wrong place. The first meeting is about embeddings, chunk size, vector databases, hybrid search, rerankers, and which model should synthesize the answer.

Those are implementation details. The first question is simpler and more uncomfortable: can the corpus actually answer the questions the business wants to ask?

The answerability test

Before we build retrieval, we write an answerability set. It has three columns:

  • The real user question.
  • The exact source passage, record, policy, ticket, contract, or transcript that supports the answer.
  • The expected answer, including when the correct answer is "not enough evidence" or "ask a human."

If the team cannot fill the second column, there is no RAG system yet. There is a knowledge-management problem wearing an AI budget.

Why similarity is not enough

Semantic similarity finds nearby text. It does not prove that the nearby text answers the question.

That distinction matters in production. A support agent asking "can this customer receive a refund?" does not need the most similar refund-policy paragraph. They need the policy clause, customer state, purchase date, exception rule, and escalation threshold that make the answer defensible.

A retrieval system that returns plausible context but cannot prove answerability is worse than a search box. It creates confidence without accountability.

What the eval should score

For production RAG, we score at least five things separately:

  • Answerability: is the answer present in the corpus at all?
  • Source grounding: does the answer cite the specific source that supports it?
  • Refusal: does the system say it cannot answer when the corpus is insufficient?
  • Freshness: does the system prefer the current source over stale duplicates?
  • Actionability: does the output match what the user can do next?

Only after those are written do we tune chunking, retrieval strategy, reranking, or model choice.

The failure mode we reject

"RAG over all our documents" is not a scope. It is usually a request to turn a messy corpus into an oracle without naming the questions, owners, or source-of-truth conflicts.

We will build RAG when the questions are concrete, the corpus is inspectable, and the eval can distinguish a grounded answer from a plausible one. If the answerability test fails, the honest recommendation is usually not "more AI." It is source cleanup, ownership, and a narrower workflow.

That is still progress. It saves the build budget from becoming a very expensive search demo.