Retrieval-augmented generation is not just "LLM plus vector database." A production RAG system has two independent lifecycles: ingestion quality and online serving latency. Kubernetes is useful when those lifecycles need separate scaling, rollout, observability, and rollback controls.

System split
| Plane | Responsibilities |
|---|
| Ingestion | Load sources, normalize, chunk, embed, attach metadata, version indexes. |
| Retrieval | Query rewrite, vector search, hybrid search, reranking, access filtering. |
| Generation | Prompt assembly, model call, streaming, guardrails, citation shaping. |
| Evaluation | Groundedness, answer quality, retrieval recall, latency, human review. |
Kubernetes mapping
- Use Jobs or workflow engines for ingestion batches.
- Use separate Deployments for retrieval, reranking, and generation when they scale differently.
- Keep vector database availability, backup, and restore strategy explicit.
- Keep tenant authorization in retrieval, not only at the final API layer.
- Emit trace spans for retrieval, rerank, prompt assembly, and model generation.
- Version source snapshots, chunking config, embedding model, and index generation together.
Ingestion lifecycle
| Step | Platform requirement |
|---|
| Source loading | Track source version, owner, freshness, and access scope. |
| Normalization | Make document format conversion deterministic and observable. |
| Chunking | Version chunking rules and test retrieval recall after changes. |
| Embedding | Pin embedding model and batch size; monitor latency and cost. |
| Index publish | Use staged index rollout, validation, and rollback to the previous index. |
Serving lifecycle
| Step | Platform requirement |
|---|
| Query handling | Carry user identity, tenant, and authorization context into retrieval. |
| Retrieval | Track top-k, filters, score distribution, and empty-result rate. |
| Reranking | Measure latency and quality impact separately from vector search. |
| Prompt assembly | Track context token count, citation IDs, and policy decisions. |
| Generation | Route to a model serving stack such as vLLM, KServe, or Ray Serve. |
Authorization boundary
RAG authorization must happen before context enters the prompt. If unauthorized documents are retrieved and only filtered after generation, the system can leak sensitive data through summaries, citations, or prompt side effects.
Evaluation loop
- Maintain curated question sets with expected sources.
- Measure retrieval recall before judging model answer quality.
- Track groundedness, citation accuracy, and unauthorized-document tests.
- Run evaluation jobs after chunking, embedding, reranker, prompt, or model changes.
- Keep index rollback ready when quality regresses.
Failure modes
- Chunking strategy changes without re-evaluating answer quality.
- Retrieval returns documents the user is not allowed to see.
- Vector DB is treated as cache but becomes a critical database without backup.
- Prompt context grows until cost and latency become unstable.
- Ingestion succeeds but publishes stale metadata or incorrect tenant filters.
- A reranker improves offline quality but breaks p95 latency for interactive routes.
Metrics
- Retrieval recall on curated test sets.
- Reranker latency and top-k distribution.
- Prompt token count and output token count.
- Groundedness review rate and citation quality.
- End-to-end p95 and p99 latency by request class.
Related pages