Skip to main content

RAG Retrieval Lab

This lab focuses on the platform mechanics behind RAG: ingestion, chunking, embedding, vector persistence, metadata filters, retrieval quality, prompt assembly, and answer evaluation.

Lab outcome

You should finish with a RAG path that can explain why a document was retrieved, which metadata policy was applied, and whether the generated answer used the retrieved context correctly.

Minimal architecture

Document sourceIngestion jobChunk and normalizeEmbedding workerVector databaseRAG APILLM runtimeEvaluation log

Step 1: define the namespaces

kubectl create namespace rag-system
kubectl create namespace llm-serving
kubectl label namespace rag-system data-class=knowledge

Step 2: create the ingestion contract

Start with a Kubernetes Job so ingestion is explicit, retryable, and reviewable.

apiVersion: batch/v1
kind: Job
metadata:
name: docs-ingestion
namespace: rag-system
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: ingest
image: ghcr.io/example/rag-ingest:0.1.0
env:
- name: SOURCE_PATH
value: /data/docs
- name: VECTOR_COLLECTION
value: platform-guides
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
memory: 4Gi

Use this as a platform review artifact even if your real ingestion system is more advanced.

Step 3: define retrieval quality cases

Create a small evaluation set before tuning chunk size or vector search parameters.

QueryExpected sourceExpected behavior
How should GPU node pools be isolated?GPU scheduling guideMentions taints, selectors, quotas, and capacity ownership.
When should Ray Serve be preferred?KServe vs Ray Serve guideExplains graph complexity and Python application ownership.
What signals matter for LLM latency?Benchmarking guideSeparates TTFT, queue wait, tokens/sec, and cost/request.

Step 4: validate metadata filters

Metadata filtering is a security and relevance boundary. Test it directly.

{
"query": "How do I debug inference latency?",
"filters": {
"audience": "platform-engineering",
"classification": "public",
"product_area": "llm-serving"
},
"top_k": 5
}

Healthy result:

  • Retrieved chunks match the requested product area.
  • Restricted documents do not appear.
  • The answer cites or logs the source chunk IDs.

Step 5: run failure drills

DrillWhat to observe
Remove one important source documentDoes the answer become uncertain, or does it invent details?
Change metadata on a restricted documentDoes policy block retrieval?
Increase top-k too farDoes answer quality drop because irrelevant chunks enter the prompt?
Kill the vector database podDoes the API fail closed with a clear user-facing response?

Validation signals

SignalHealthy result
Ingestion lagThe platform knows when content is stale.
Retrieval precisionExpected source appears in the top results for known queries.
Context useThe answer reflects retrieved context rather than general model memory.
Policy enforcementMetadata filters prevent unauthorized retrieval.
Evaluation historyChanges to chunking or embeddings can be compared over time.