Skip to main content

RAG On Kubernetes

Retrieval-augmented generation is not just "LLM plus vector database." A production RAG system has two independent lifecycles: ingestion quality and online serving latency. Kubernetes is useful when those lifecycles need separate scaling, rollout, observability, and rollback controls.

RAG platform on Kubernetes architecture

System split

PlaneResponsibilities
IngestionLoad sources, normalize, chunk, embed, attach metadata, version indexes.
RetrievalQuery rewrite, vector search, hybrid search, reranking, access filtering.
GenerationPrompt assembly, model call, streaming, guardrails, citation shaping.
EvaluationGroundedness, answer quality, retrieval recall, latency, human review.

Kubernetes mapping

  • Use Jobs or workflow engines for ingestion batches.
  • Use separate Deployments for retrieval, reranking, and generation when they scale differently.
  • Keep vector database availability, backup, and restore strategy explicit.
  • Keep tenant authorization in retrieval, not only at the final API layer.
  • Emit trace spans for retrieval, rerank, prompt assembly, and model generation.
  • Version source snapshots, chunking config, embedding model, and index generation together.

Ingestion lifecycle

StepPlatform requirement
Source loadingTrack source version, owner, freshness, and access scope.
NormalizationMake document format conversion deterministic and observable.
ChunkingVersion chunking rules and test retrieval recall after changes.
EmbeddingPin embedding model and batch size; monitor latency and cost.
Index publishUse staged index rollout, validation, and rollback to the previous index.

Serving lifecycle

StepPlatform requirement
Query handlingCarry user identity, tenant, and authorization context into retrieval.
RetrievalTrack top-k, filters, score distribution, and empty-result rate.
RerankingMeasure latency and quality impact separately from vector search.
Prompt assemblyTrack context token count, citation IDs, and policy decisions.
GenerationRoute to a model serving stack such as vLLM, KServe, or Ray Serve.

Authorization boundary

RAG authorization must happen before context enters the prompt. If unauthorized documents are retrieved and only filtered after generation, the system can leak sensitive data through summaries, citations, or prompt side effects.

Evaluation loop

  • Maintain curated question sets with expected sources.
  • Measure retrieval recall before judging model answer quality.
  • Track groundedness, citation accuracy, and unauthorized-document tests.
  • Run evaluation jobs after chunking, embedding, reranker, prompt, or model changes.
  • Keep index rollback ready when quality regresses.

Failure modes

  • Chunking strategy changes without re-evaluating answer quality.
  • Retrieval returns documents the user is not allowed to see.
  • Vector DB is treated as cache but becomes a critical database without backup.
  • Prompt context grows until cost and latency become unstable.
  • Ingestion succeeds but publishes stale metadata or incorrect tenant filters.
  • A reranker improves offline quality but breaks p95 latency for interactive routes.

Metrics

  • Retrieval recall on curated test sets.
  • Reranker latency and top-k distribution.
  • Prompt token count and output token count.
  • Groundedness review rate and citation quality.
  • End-to-end p95 and p99 latency by request class.