RAG on Kubernetes

Retrieval-augmented generation is not just "LLM plus vector database." A production RAG system has two independent lifecycles: ingestion quality and online serving latency. Kubernetes is useful when those lifecycles need separate scaling, rollout, observability, and rollback controls.

RAG platform on Kubernetes architecture

System split

Plane	Responsibilities
Ingestion	Load sources, normalize, chunk, embed, attach metadata, version indexes.
Retrieval	Query rewrite, vector search, hybrid search, reranking, access filtering.
Generation	Prompt assembly, model call, streaming, guardrails, citation shaping.
Evaluation	Groundedness, answer quality, retrieval recall, latency, human review.

Kubernetes mapping

Use Jobs or workflow engines for ingestion batches.
Use separate Deployments for retrieval, reranking, and generation when they scale differently.
Keep vector database availability, backup, and restore strategy explicit.
Keep tenant authorization in retrieval, not only at the final API layer.
Emit trace spans for retrieval, rerank, prompt assembly, and model generation.
Version source snapshots, chunking config, embedding model, and index generation together.

Ingestion lifecycle

Step	Platform requirement
Source loading	Track source version, owner, freshness, and access scope.
Normalization	Make document format conversion deterministic and observable.
Chunking	Version chunking rules and test retrieval recall after changes.
Embedding	Pin embedding model and batch size; monitor latency and cost.
Index publish	Use staged index rollout, validation, and rollback to the previous index.

Serving lifecycle

Step	Platform requirement
Query handling	Carry user identity, tenant, and authorization context into retrieval.
Retrieval	Track top-k, filters, score distribution, and empty-result rate.
Reranking	Measure latency and quality impact separately from vector search.
Prompt assembly	Track context token count, citation IDs, and policy decisions.
Generation	Route to a model serving stack such as vLLM, KServe, or Ray Serve.

Authorization boundary

RAG authorization must happen before context enters the prompt. If unauthorized documents are retrieved and only filtered after generation, the system can leak sensitive data through summaries, citations, or prompt side effects.

Evaluation loop

Maintain curated question sets with expected sources.
Measure retrieval recall before judging model answer quality.
Track groundedness, citation accuracy, and unauthorized-document tests.
Run evaluation jobs after chunking, embedding, reranker, prompt, or model changes.
Keep index rollback ready when quality regresses.

Failure modes

Chunking strategy changes without re-evaluating answer quality.
Retrieval returns documents the user is not allowed to see.
Vector DB is treated as cache but becomes a critical database without backup.
Prompt context grows until cost and latency become unstable.
Ingestion succeeds but publishes stale metadata or incorrect tenant filters.
A reranker improves offline quality but breaks p95 latency for interactive routes.

Metrics

Retrieval recall on curated test sets.
Reranker latency and top-k distribution.
Prompt token count and output token count.
Groundedness review rate and citation quality.
End-to-end p95 and p99 latency by request class.

System split​

Kubernetes mapping​

Ingestion lifecycle​

Serving lifecycle​

Authorization boundary​

Evaluation loop​

Failure modes​

Metrics​

Related pages​