📖 Introduction

The AIOS Observability Stack provides a unified platform for monitoring, debugging, and analyzing distributed AI workloads. It combines metrics, tracing, and logging into a cohesive system, enabling developers and operators to gain end-to-end visibility across agents, blocks, clusters, and vDAGs.

Designed as a stack rather than a single tool, it includes:

Kubernetes Deployments → Shell scripts & Helm charts for deploying observability services (Prometheus, Loki, Tempo, Grafana, Redis, Thanos).
SDKs → Python libraries (aios_metrics, aios_tracing) for easy in-app instrumentation.
System Services → Long-running collectors and databases for block, cluster, and vDAG metrics persistence.
Global Metrics Layer → Aggregation services that unify node-level and block-level telemetry into a consistent, queryable model.

The stack is Kubernetes-first and agent-oriented, built to support large-scale, multi-cluster AI deployments. It provides the observability backbone for the AIOS ecosystem and can also be reused as a standalone monitoring platform in other distributed systems.

✅ Key Goals of the Observability Stack:

Provide reliable, low-latency monitoring for compute-heavy AI blocks.
Enable causal tracing across clusters for debugging distributed pipelines.
Support scalable logging with retention, persistence, and S3/MinIO backends.
Allow applications to self-instrument via lightweight SDKs.
Deliver global, aggregated views of system health (block, cluster, vDAG).