14 docs tagged with "llm"

GPU Capacity Incident

Field note for GPU node pool Kubernetes incidents where accelerators are expensive, underutilized, or unavailable for LLM inference pods.

GPU Node Pool Kubernetes

Senior guide to GPU node pool design, scheduling, taints, labels, autoscaling, and capacity safety for LLM workloads on Kubernetes.

GPU Node Pool Scheduling for LLM Inference

Debug GPU node pool scheduling for LLM inference with labels, taints, tolerations, GPU requests, quotas, autoscaling buffers, and cost signals.

Inference Benchmarking and Cost Model

Benchmark LLM inference on Kubernetes using latency phases, throughput, GPU pressure, and cost per request.

KServe vs Ray Serve

Compare KServe and Ray Serve for LLM serving on Kubernetes by ownership model, CRDs, serving graph complexity, autoscaling, rollout behavior, and team fit.

Kubernetes LLM Labs

Hands-on Kubernetes LLM labs for vLLM inference, RAG retrieval, observability, and production readiness.

Learning Map

Senior learning map for Kubernetes, platform services, and LLM workloads on Kubernetes.

LLM Inference Stack

Reference architecture for LLM inference on Kubernetes.

LLM Observability Challenge

Challenge-style observability lab for Kubernetes LLM workloads covering latency, queueing, GPU saturation, traces, logs, and alerts.

LLM on Kubernetes

Senior guide to Kubernetes LLM infrastructure with GPU node pools, vLLM, KServe, Ray Serve, RAG, benchmarking, and cost controls.

Model Serving Options

Compare vLLM, KServe, Ray Serve, and Triton for Kubernetes LLM serving, and link to deeper vLLM Kubernetes and KServe vs Ray Serve guides.

RAG Failure Modes and Evaluation

Failure modes and evaluation strategy for production RAG systems on Kubernetes.

RAG on Kubernetes

Production RAG on Kubernetes guide covering ingestion, retrieval, vector databases, serving, evaluation, authorization, observability, and failure modes.

RAG Platform

Reference architecture for retrieval augmented generation on Kubernetes.