Skip to content

๐Ÿš€ Deploying the Observability Stack

This section covers end-to-end deployment of the AIOS Observability Stack on Kubernetes using the helper scripts in k8s/. Youโ€™ll get quickstarts, full flag tables, and verification steps for:

  • Metrics stack (Prometheus, Grafana, optional Loki/Tempo/Thanos) โ†’ k8s/deploy_observability_stack.sh
  • Tracing stack (Tempo + OpenTelemetry Collector + optional Grafana) โ†’ k8s/deploy_tracing.sh
  • Logging stack (Loki + Promtail + optional Grafana) โ†’ k8s/deploy_logging.sh

โœ… Prerequisites

  • A working Kubernetes cluster (v1.24+ recommended)
  • kubectl and helm installed and pointing to your cluster
  • (Optional) S3/MinIO credentials if you plan to use object storage
  • Sufficient storage class available for PVCs (or set --storage-class flags)

Tip: Pass a specific kubeconfig file with KUBECONFIG=/path/to/kubeconfig or add --kubeconfig to your kubectl/helm env as needed.


๐Ÿงญ What gets installed (at a glance)

Stack Components (typical) Namespace
Metrics Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, optional Loki, optional Tempo, Thanos Ruler observability
Tracing Grafana Tempo (single/distributed), OpenTelemetry Collector (agent + gateway), optional Grafana tracing
Logging Loki (single/distributed), Promtail, optional Grafana logging

โšก๏ธ TL;DR โ€” One-liners

Metrics (Prometheus + Grafana; no Loki/Tempo)

k8s/deploy_observability_stack.sh \
  --action install \
  --namespace observability \
  --with-loki false \
  --with-tempo false

Tracing (Tempo distributed + OTel Collector + Grafana via NodePort)

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --with-grafana true \
  --grafana-enable-nodeport true --grafana-nodeport 32010 \
  --tempo-objstore filesystem

Logging (Loki single + Promtail + Grafana via NodePort)

k8s/deploy_logging.sh \
  --action install \
  --mode single \
  --with-promtail true \
  --with-grafana true \
  --persistence false \
  --grafana-enable-nodeport true --grafana-nodeport 32199

๐Ÿ“ฆ Metrics Stack โ€” deploy_observability_stack.sh

Common examples

Full stack with Loki + Thanos Ruler, Grafana via Ingress

k8s/deploy_observability_stack.sh \
  --action install \
  --namespace observability \
  --with-loki true \
  --with-tempo false \
  --with-thanos-ruler true \
  --thanos-s3-bucket aios-metrics \
  --thanos-s3-endpoint minio.minio.svc.cluster.local:9000 \
  --thanos-s3-access-key minioadmin \
  --thanos-s3-secret-key minioadmin \
  --thanos-s3-insecure true \
  --storage-class rook-ceph-block \
  --grafana-enable-ingress true --grafana-ingress-host grafana.example.com

Minimal metrics only (Prometheus + Grafana via port-forward)

k8s/deploy_observability_stack.sh \
  --action install \
  --with-loki false \
  --with-tempo false \
  --grafana-enable-nodeport false

Flag reference

Flag Description Default
--action install | uninstall | status | port-forward install
--namespace K8s namespace to install into observability
--with-loki Add Loki + Promtail alongside metrics true
--with-tempo Add Tempo (tracing) via Grafana chart (basic) false
--storage-class Storage class for all PVCs (if set) ""
--prom-retention Prometheus retention window 15d
--pvc-prom-size Prometheus PVC size 20Gi
--pvc-grafana-size Grafana PVC size 10Gi
--grafana-admin-user Grafana admin username admin
--grafana-admin-password Grafana admin password (auto-generated if empty) ""
--grafana-enable-ingress Enable Grafana Ingress false
--grafana-ingress-host Host for Grafana Ingress ""
--grafana-enable-nodeport Expose Grafana as NodePort false
--grafana-nodeport NodePort for Grafana 32000
--prom-req/lim-* Prometheus CPU/memory requests/limits see script defaults

Thanos Ruler & Sidecar

Flag Description Default
--with-thanos-ruler Enable Thanos Ruler and Prometheus sidecar false
--thanos-ruler-replicas Ruler replicas 1
--pvc-thanos-ruler-size Ruler PVC size 10Gi
--thanos-objstore-type Object store type (s3) s3
--thanos-s3-bucket S3 bucket for Thanos required if enabled
--thanos-s3-endpoint S3/MinIO endpoint required if enabled
--thanos-s3-access-key Access key required if enabled
--thanos-s3-secret-key Secret key required if enabled
--thanos-s3-insecure Allow HTTP/self-signed false
--thanos-s3-prefix Optional bucket prefix ""

Verify

kubectl -n observability get pods
kubectl -n observability get svc
# Grafana (port-forward if no Ingress/NodePort):
kubectl -n observability port-forward svc/kps-grafana 3000:80
open http://localhost:3000

๐Ÿ”Ž Tracing Stack โ€” deploy_tracing.sh

Common examples

Distributed Tempo + OTel Collector + Grafana via NodePort

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --with-grafana true \
  --grafana-enable-nodeport true --grafana-nodeport 32010 \
  --tempo-objstore filesystem

Tempo single-binary + OTel Collector (dev)

k8s/deploy_tracing.sh \
  --action install \
  --tempo-mode single \
  --tempo-objstore filesystem \
  --with-grafana false

Tempo distributed with MinIO

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --tempo-objstore s3 \
  --tempo-s3-bucket traces \
  --tempo-s3-endpoint minio.minio.svc.cluster.local:9000 \
  --tempo-s3-access-key minioadmin \
  --tempo-s3-secret-key minioadmin \
  --tempo-s3-insecure true \
  --with-grafana true

Flag reference

Flag Description Default
--action install | uninstall | status | port-forward install
--namespace Target namespace tracing
--with-tempo Deploy Tempo true
--tempo-mode distributed | single distributed
--tempo-retention Retention window (48h, etc.) 48h
--tempo-objstore s3 | gcs | azure | filesystem s3
--tempo-s3-* S3/MinIO connection (bucket/endpoint/keys/insecure/prefix) โ€”
--pvc-tempo-* PVC sizes for WAL/ingester/storegw/compactor/querier see script
--with-otel Deploy OTel Collector (agent+gateway) true
--otel-sampling-ratio 0.0โ€“1.0 probability sampling 1.0
--otel-gateway-replicas OTel gateway replicas 2
--otel-enable-logs-pipeline Also forward OTLP logs false
--with-grafana Deploy Grafana + Tempo datasource false
--grafana-* Admin / ingress / nodeport / PVC options see script

Verify

kubectl -n tracing get pods
# Port-forward Tempo query-frontend:
kubectl -n tracing port-forward svc/tempo-distributed-query-frontend 3100:3100
# Grafana (if enabled):
kubectl -n tracing port-forward svc/grafana 3000:80

Send a test trace (Python)

# point to your OTel Collector service in 'tracing' namespace
from aios_tracing.tracing import TracingSDK
sdk = TracingSDK(service_name="smoke-test", otlp_endpoint="otel-collector.tracing.svc:4317")
@sdk.trace() 
def hello(): return "world"
hello()

๐Ÿชต Logging Stack โ€” deploy_logging.sh

Loki-distributed (HA) + Promtail + Grafana via Ingress

k8s/deploy_logging.sh \
  --action install \
  --mode distributed \
  --with-promtail true \
  --with-grafana true \
  --grafana-enable-ingress true \
  --grafana-ingress-host grafana.example.com \
  --objstore s3 \
  --s3-bucket logs \
  --s3-endpoint s3.amazonaws.com \
  --s3-access-key AKIA... \
  --s3-secret-key ... \
  --retention-hours 168

Flag reference

Flag Description Default
--action install | uninstall | status | port-forward install
--namespace Namespace to deploy to logging
--release Helm release prefix/name loki
--mode single (loki-stack) | distributed (loki-distributed) single
--with-promtail Deploy Promtail daemonset true
--with-grafana Deploy Grafana true
--persistence Enable Loki PVCs (single mode) false
--pvc-size Loki PVC size (single mode) 20Gi
--storage-class PVC storageClassName (all PVCs) ""
--objstore filesystem | s3 (for chunks/index) filesystem
--s3-bucket S3/MinIO bucket โ€”
--s3-endpoint S3 endpoint or host:port for MinIO โ€”
--s3-access-key S3 access key โ€”
--s3-secret-key S3 secret key โ€”
--s3-insecure Allow HTTP/self-signed (MinIO) false
--s3-prefix Optional key prefix ""
--retention-hours Log retention (compactor/limits) 168
--promtail-extra-labels Extra labels k=v,k2=v2 ""
--promtail-namespaces all or comma list (ns1,ns2) all
--promtail-host-logs Collect /var/log/*log true
--grafana-admin-user Grafana admin user admin
--grafana-admin-password Grafana admin password admin
--grafana-enable-ingress Enable Ingress for Grafana false
--grafana-ingress-host Ingress host ""
--grafana-enable-nodeport Expose Grafana via NodePort true
--grafana-nodeport Grafana NodePort 32199
--grafana-pvc-size Grafana PVC size 10Gi
--pvc-ingester-size (distributed) Ingester PVC 20Gi
--pvc-storegw-size (distributed) Store-Gateway PVC 20Gi
--pvc-compactor-size (distributed) Compactor PVC 10Gi

Verify

# Pods & services
kubectl -n logging get pods
kubectl -n logging get svc

# Loki endpoint:
#  - single:      svc/loki (port 3100)
#  - distributed: svc/loki-query-frontend (port 3100)
kubectl -n logging port-forward svc/loki-query-frontend 3100:3100  # if distributed

# Grafana (NodePort or port-forward)
kubectl -n logging port-forward svc/loki-grafana 3000:80

Promtail quick sanity checks

# Ensure promtail is scraping pods
kubectl -n logging get ds -l app=promtail
kubectl -n logging logs ds/loki-promtail -c promtail --tail=200

# Query in Grafana Explore with datasource "Loki"
{job="kubernetes-pods"} |= "ERROR"

๐Ÿงช Post-install smoke tests

Grafana log-in + datasource checks

  1. Open Grafana (Ingress URL, NodePort, or kubectl port-forward).
  2. Log in with admin credentials.
  3. Go to Connections โ†’ Data sources:

  4. Prometheus (metrics stack): reachable โœ…

  5. Loki (logging stack): reachable โœ…
  6. Tempo (tracing stack): reachable โœ…
  7. Explore tab: run a sample query for each datasource.

End-to-end trace + log correlation (optional)

If you installed tracing and logging:

  • Emit a test span with your aios_tracing SDK.
  • Log a line with trace_id injected (via logging correlation).
  • In Grafana Explore:

  • Start from a trace in Tempo โ†’ jump to logs (trace-to-logs).

  • Or start from logs โ†’ jump to trace (logs-to-trace).

๐Ÿงฐ Troubleshooting

Symptom Likely Cause Fix
helm upgrade --install hangs Webhooks / CRDs not ready kubectl get events -A; retry with --wait or check CRDs
Grafana 502 via Ingress Wrong host/TLS or service type Verify --grafana-ingress-host, Ingress controller logs
Loki returns 500 on queries Schema/object store mismatch Ensure schema_config matches objstore choice; check compactor
Promtail pods CrashLoopBackOff HostPath or RBAC issues Check DaemonSet events; disable --promtail-host-logs to isolate
Tempo โ€œno blocksโ€ Object store creds or retention too short Verify --tempo-s3-* flags; check compactor logs
Prometheus out of space PVC too small Increase --pvc-prom-size and --prom-retention accordingly

Log digging tips:

kubectl -n observability logs deploy/kps-grafana --tail=200
kubectl -n logging logs deploy/loki-distributed-distributor --tail=200
kubectl -n tracing logs deploy/tempo-distributed-gateway --tail=200

๐Ÿ” Security & hardening tips

  • Credentials: Avoid passing secrets via flags in CI logs. Prefer Kubernetes Secrets and reference in values.
  • Network: Use Ingress with TLS and NetworkPolicies to restrict access to Grafana and data services.
  • RBAC: Restrict Promtail/OTel permissions to only what they need.
  • Multitenancy: Use Prometheus tenants or separate namespaces/releases if isolating teams.
  • Retention: Set realistic --retention-hours and object store lifecycle rules to control costs.

๐Ÿงน Teardown & cleanup

# Uninstall stacks
k8s/deploy_observability_stack.sh --action uninstall --namespace observability
k8s/deploy_tracing.sh --action uninstall --namespace tracing
k8s/deploy_logging.sh --action uninstall --namespace logging

# (Optional) delete namespaces and PVs โ€” irreversible if reclaim policy=Delete
kubectl delete ns observability tracing logging

๐Ÿ’ก Operator tips

  • StorageClass: When in doubt, set --storage-class <class> explicitly to avoid Pending PVCs.
  • S3/MinIO: For development, MinIO endpoints often require --s3-insecure true and s3forcepathstyle.
  • Scaling: Use the distributed modes for Loki/Tempo when you need HA and horizontal scaling.
  • Dashboards: Pre-provision Grafana dashboards and alerting rules through Helm values or ConfigMaps.
  • CI/CD: Parameterize these scripts in your pipeline (env vars โ†’ flags) for consistent cluster bring-up.