Skip to main content

Observability Baseline

Observability should answer operational questions under pressure: what broke, who is affected, whether the fix is working, and what risk remains.

Kubernetes observability pipeline architecture

Signal model

SignalPurpose
MetricsFast health, SLOs, capacity, alerting.
LogsDetailed context and forensic trail.
TracesRequest path, dependency latency, fan-out behavior.
EventsKubernetes lifecycle and scheduling decisions.

Minimum dashboards

  • Cluster health: API server latency, node readiness, pod scheduling failures, DNS latency.
  • Workload health: request rate, error rate, latency, saturation.
  • Autoscaling: desired replicas, actual replicas, pending pods, node provisioning time.
  • Security operations: admission rejections, audit anomalies, policy exceptions.
  • LLM serving: TTFT, tokens/sec, queue wait, GPU memory, model load time, cost/request.

Alerting stance

Alert on user impact and fast-burning SLOs. Page less on symptoms that can be observed in dashboards unless they directly threaten availability or data safety.

Failure modes

  • High-cardinality labels explode storage cost.
  • Dashboards track infrastructure but not user impact.
  • Logs exist but lack tenant, request, model, or version context.
  • Traces are sampled away exactly when incidents happen.