📖 Introduction
The AIOS Observability Stack provides a unified platform for monitoring, debugging, and analyzing distributed AI workloads. It combines metrics, tracing, and logging into a cohesive system, enabling developers and operators to gain end-to-end visibility across agents, blocks, clusters, and vDAGs.
Designed as a stack rather than a single tool, it includes:
- Kubernetes Deployments → Shell scripts & Helm charts for deploying observability services (Prometheus, Loki, Tempo, Grafana, Redis, Thanos).
- SDKs → Python libraries (
aios_metrics
,aios_tracing
) for easy in-app instrumentation. - System Services → Long-running collectors and databases for block, cluster, and vDAG metrics persistence.
- Global Metrics Layer → Aggregation services that unify node-level and block-level telemetry into a consistent, queryable model.
The stack is Kubernetes-first and agent-oriented, built to support large-scale, multi-cluster AI deployments. It provides the observability backbone for the AIOS ecosystem and can also be reused as a standalone monitoring platform in other distributed systems.
✅ Key Goals of the Observability Stack:
- Provide reliable, low-latency monitoring for compute-heavy AI blocks.
- Enable causal tracing across clusters for debugging distributed pipelines.
- Support scalable logging with retention, persistence, and S3/MinIO backends.
- Allow applications to self-instrument via lightweight SDKs.
- Deliver global, aggregated views of system health (block, cluster, vDAG).