Skip to main content

RAG Failure Modes And Evaluation

Production RAG fails in more ways than model hallucination. Retrieval, metadata policy, context assembly, and evaluation drift are often the real source of bad answers.

Failure modes

FailureSignal
Bad chunkingRelevant documents exist but retrieval misses useful spans.
Weak metadataRetrieved context cannot be filtered by tenant, source, or freshness.
Authorization gapRetrieval returns documents the user should not see.
Context bloatPrompt grows until latency and cost become unstable.
Reranker regressionBetter vector recall still produces worse final context.
Stale indexAnswers miss recently changed source material.

Evaluation layers

  • Retrieval recall: did top-k include the expected source?
  • Groundedness: is the answer supported by retrieved evidence?
  • Citation quality: can users inspect the evidence path?
  • Policy compliance: did retrieval obey tenant and access constraints?
  • Latency: did retrieval, rerank, prompt assembly, and generation stay within SLO?

Kubernetes operating model

  • Run ingestion as versioned Jobs or workflows.
  • Emit traces for retrieve, rerank, prompt assembly, and generation.
  • Keep vector database backup and restore separate from model serving.
  • Deploy evaluation jobs on every chunking, embedding, reranker, or prompt change.

Release gate

Do not release a RAG change unless it passes curated evaluation sets, unauthorized-document tests, latency thresholds, and rollback readiness for the previous index version.