Observability Pipeline
Intent
Collect signals from workloads and cluster components, standardize labels, store by access pattern, and expose operations views tied to SLOs.
Key decisions
- Metrics are optimized for alerting and dashboards.
- Logs carry enough context for forensic debugging without uncontrolled cardinality.
- Traces show request path and dependency latency.
- Events explain Kubernetes scheduling, lifecycle, and policy outcomes.
Review signals
- Every alert has an owner and an action.
- Dashboards start from user impact.
- LLM routes expose TTFT, tokens/sec, queue wait, and GPU pressure.
- Storage retention matches incident and audit requirements.