Skip to main content

Observability Stack

An observability stack must optimize for incident questions, not vendor completeness. Start with user impact, then wire metrics, logs, traces, and events into a shared context model.

Kubernetes observability pipeline architecture

Stack shape

LayerCommon toolsKey decision
MetricsPrometheus, Mimir, ThanosRetention, cardinality, HA, remote write.
LogsLoki, Fluent Bit, VectorLabel discipline and tenant isolation.
TracesOpenTelemetry, Tempo, JaegerSampling and context propagation.
DashboardsGrafanaSLO-first layout and ownership.
AlertsAlertmanager, incident toolingRouting, deduplication, on-call policy.

Label contract

Every signal should carry enough context to join across systems:

cluster, namespace, workload, service, team, environment, version, tenant where applicable

For LLM serving, add:

model, model_version, route, accelerator_type, request_class

Failure modes

  • Cardinality grows faster than storage budget.
  • Logs and metrics use different labels, making correlation slow.
  • Alerts fire on infrastructure symptoms but not SLO burn.
  • Traces miss edge or model-serving spans, hiding queue time.