๐ Deploying the Observability Stack
This section covers end-to-end deployment of the AIOS Observability Stack on Kubernetes using the helper scripts in k8s/
. Youโll get quickstarts, full flag tables, and verification steps for:
- Metrics stack (Prometheus, Grafana, optional Loki/Tempo/Thanos) โ
k8s/deploy_observability_stack.sh
- Tracing stack (Tempo + OpenTelemetry Collector + optional Grafana) โ
k8s/deploy_tracing.sh
- Logging stack (Loki + Promtail + optional Grafana) โ
k8s/deploy_logging.sh
โ Prerequisites
- A working Kubernetes cluster (v1.24+ recommended)
kubectl
andhelm
installed and pointing to your cluster- (Optional) S3/MinIO credentials if you plan to use object storage
- Sufficient storage class available for PVCs (or set
--storage-class
flags)
Tip: Pass a specific kubeconfig file with
KUBECONFIG=/path/to/kubeconfig
or add--kubeconfig
to yourkubectl
/helm
env as needed.
๐งญ What gets installed (at a glance)
Stack | Components (typical) | Namespace |
---|---|---|
Metrics | Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, optional Loki, optional Tempo, Thanos Ruler | observability |
Tracing | Grafana Tempo (single/distributed), OpenTelemetry Collector (agent + gateway), optional Grafana | tracing |
Logging | Loki (single/distributed), Promtail, optional Grafana | logging |
โก๏ธ TL;DR โ One-liners
Metrics (Prometheus + Grafana; no Loki/Tempo)
k8s/deploy_observability_stack.sh \
--action install \
--namespace observability \
--with-loki false \
--with-tempo false
Tracing (Tempo distributed + OTel Collector + Grafana via NodePort)
k8s/deploy_tracing.sh \
--action install \
--namespace tracing \
--with-grafana true \
--grafana-enable-nodeport true --grafana-nodeport 32010 \
--tempo-objstore filesystem
Logging (Loki single + Promtail + Grafana via NodePort)
k8s/deploy_logging.sh \
--action install \
--mode single \
--with-promtail true \
--with-grafana true \
--persistence false \
--grafana-enable-nodeport true --grafana-nodeport 32199
๐ฆ Metrics Stack โ deploy_observability_stack.sh
Common examples
Full stack with Loki + Thanos Ruler, Grafana via Ingress
k8s/deploy_observability_stack.sh \
--action install \
--namespace observability \
--with-loki true \
--with-tempo false \
--with-thanos-ruler true \
--thanos-s3-bucket aios-metrics \
--thanos-s3-endpoint minio.minio.svc.cluster.local:9000 \
--thanos-s3-access-key minioadmin \
--thanos-s3-secret-key minioadmin \
--thanos-s3-insecure true \
--storage-class rook-ceph-block \
--grafana-enable-ingress true --grafana-ingress-host grafana.example.com
Minimal metrics only (Prometheus + Grafana via port-forward)
k8s/deploy_observability_stack.sh \
--action install \
--with-loki false \
--with-tempo false \
--grafana-enable-nodeport false
Flag reference
Flag | Description | Default |
---|---|---|
--action |
install | uninstall | status | port-forward |
install |
--namespace |
K8s namespace to install into | observability |
--with-loki |
Add Loki + Promtail alongside metrics | true |
--with-tempo |
Add Tempo (tracing) via Grafana chart (basic) | false |
--storage-class |
Storage class for all PVCs (if set) | "" |
--prom-retention |
Prometheus retention window | 15d |
--pvc-prom-size |
Prometheus PVC size | 20Gi |
--pvc-grafana-size |
Grafana PVC size | 10Gi |
--grafana-admin-user |
Grafana admin username | admin |
--grafana-admin-password |
Grafana admin password (auto-generated if empty) | "" |
--grafana-enable-ingress |
Enable Grafana Ingress | false |
--grafana-ingress-host |
Host for Grafana Ingress | "" |
--grafana-enable-nodeport |
Expose Grafana as NodePort | false |
--grafana-nodeport |
NodePort for Grafana | 32000 |
--prom-req/lim-* |
Prometheus CPU/memory requests/limits | see script defaults |
Thanos Ruler & Sidecar
Flag | Description | Default |
---|---|---|
--with-thanos-ruler |
Enable Thanos Ruler and Prometheus sidecar | false |
--thanos-ruler-replicas |
Ruler replicas | 1 |
--pvc-thanos-ruler-size |
Ruler PVC size | 10Gi |
--thanos-objstore-type |
Object store type (s3 ) |
s3 |
--thanos-s3-bucket |
S3 bucket for Thanos | required if enabled |
--thanos-s3-endpoint |
S3/MinIO endpoint | required if enabled |
--thanos-s3-access-key |
Access key | required if enabled |
--thanos-s3-secret-key |
Secret key | required if enabled |
--thanos-s3-insecure |
Allow HTTP/self-signed | false |
--thanos-s3-prefix |
Optional bucket prefix | "" |
Verify
kubectl -n observability get pods
kubectl -n observability get svc
# Grafana (port-forward if no Ingress/NodePort):
kubectl -n observability port-forward svc/kps-grafana 3000:80
open http://localhost:3000
๐ Tracing Stack โ deploy_tracing.sh
Common examples
Distributed Tempo + OTel Collector + Grafana via NodePort
k8s/deploy_tracing.sh \
--action install \
--namespace tracing \
--with-grafana true \
--grafana-enable-nodeport true --grafana-nodeport 32010 \
--tempo-objstore filesystem
Tempo single-binary + OTel Collector (dev)
k8s/deploy_tracing.sh \
--action install \
--tempo-mode single \
--tempo-objstore filesystem \
--with-grafana false
Tempo distributed with MinIO
k8s/deploy_tracing.sh \
--action install \
--namespace tracing \
--tempo-objstore s3 \
--tempo-s3-bucket traces \
--tempo-s3-endpoint minio.minio.svc.cluster.local:9000 \
--tempo-s3-access-key minioadmin \
--tempo-s3-secret-key minioadmin \
--tempo-s3-insecure true \
--with-grafana true
Flag reference
Flag | Description | Default |
---|---|---|
--action |
install | uninstall | status | port-forward |
install |
--namespace |
Target namespace | tracing |
--with-tempo |
Deploy Tempo | true |
--tempo-mode |
distributed | single |
distributed |
--tempo-retention |
Retention window (48h , etc.) |
48h |
--tempo-objstore |
s3 | gcs | azure | filesystem |
s3 |
--tempo-s3-* |
S3/MinIO connection (bucket/endpoint/keys/insecure/prefix) | โ |
--pvc-tempo-* |
PVC sizes for WAL/ingester/storegw/compactor/querier | see script |
--with-otel |
Deploy OTel Collector (agent+gateway) | true |
--otel-sampling-ratio |
0.0โ1.0 probability sampling | 1.0 |
--otel-gateway-replicas |
OTel gateway replicas | 2 |
--otel-enable-logs-pipeline |
Also forward OTLP logs | false |
--with-grafana |
Deploy Grafana + Tempo datasource | false |
--grafana-* |
Admin / ingress / nodeport / PVC options | see script |
Verify
kubectl -n tracing get pods
# Port-forward Tempo query-frontend:
kubectl -n tracing port-forward svc/tempo-distributed-query-frontend 3100:3100
# Grafana (if enabled):
kubectl -n tracing port-forward svc/grafana 3000:80
Send a test trace (Python)
# point to your OTel Collector service in 'tracing' namespace
from aios_tracing.tracing import TracingSDK
sdk = TracingSDK(service_name="smoke-test", otlp_endpoint="otel-collector.tracing.svc:4317")
@sdk.trace()
def hello(): return "world"
hello()
๐ชต Logging Stack โ deploy_logging.sh
Loki-distributed (HA) + Promtail + Grafana via Ingress
k8s/deploy_logging.sh \
--action install \
--mode distributed \
--with-promtail true \
--with-grafana true \
--grafana-enable-ingress true \
--grafana-ingress-host grafana.example.com \
--objstore s3 \
--s3-bucket logs \
--s3-endpoint s3.amazonaws.com \
--s3-access-key AKIA... \
--s3-secret-key ... \
--retention-hours 168
Flag reference
Flag | Description | Default |
---|---|---|
--action |
install | uninstall | status | port-forward |
install |
--namespace |
Namespace to deploy to | logging |
--release |
Helm release prefix/name | loki |
--mode |
single (loki-stack) | distributed (loki-distributed) |
single |
--with-promtail |
Deploy Promtail daemonset | true |
--with-grafana |
Deploy Grafana | true |
--persistence |
Enable Loki PVCs (single mode) | false |
--pvc-size |
Loki PVC size (single mode) | 20Gi |
--storage-class |
PVC storageClassName (all PVCs) | "" |
--objstore |
filesystem | s3 (for chunks/index) |
filesystem |
--s3-bucket |
S3/MinIO bucket | โ |
--s3-endpoint |
S3 endpoint or host:port for MinIO |
โ |
--s3-access-key |
S3 access key | โ |
--s3-secret-key |
S3 secret key | โ |
--s3-insecure |
Allow HTTP/self-signed (MinIO) | false |
--s3-prefix |
Optional key prefix | "" |
--retention-hours |
Log retention (compactor/limits) | 168 |
--promtail-extra-labels |
Extra labels k=v,k2=v2 |
"" |
--promtail-namespaces |
all or comma list (ns1,ns2 ) |
all |
--promtail-host-logs |
Collect /var/log/*log |
true |
--grafana-admin-user |
Grafana admin user | admin |
--grafana-admin-password |
Grafana admin password | admin |
--grafana-enable-ingress |
Enable Ingress for Grafana | false |
--grafana-ingress-host |
Ingress host | "" |
--grafana-enable-nodeport |
Expose Grafana via NodePort | true |
--grafana-nodeport |
Grafana NodePort | 32199 |
--grafana-pvc-size |
Grafana PVC size | 10Gi |
--pvc-ingester-size |
(distributed) Ingester PVC | 20Gi |
--pvc-storegw-size |
(distributed) Store-Gateway PVC | 20Gi |
--pvc-compactor-size |
(distributed) Compactor PVC | 10Gi |
Verify
# Pods & services
kubectl -n logging get pods
kubectl -n logging get svc
# Loki endpoint:
# - single: svc/loki (port 3100)
# - distributed: svc/loki-query-frontend (port 3100)
kubectl -n logging port-forward svc/loki-query-frontend 3100:3100 # if distributed
# Grafana (NodePort or port-forward)
kubectl -n logging port-forward svc/loki-grafana 3000:80
Promtail quick sanity checks
# Ensure promtail is scraping pods
kubectl -n logging get ds -l app=promtail
kubectl -n logging logs ds/loki-promtail -c promtail --tail=200
# Query in Grafana Explore with datasource "Loki"
{job="kubernetes-pods"} |= "ERROR"
๐งช Post-install smoke tests
Grafana log-in + datasource checks
- Open Grafana (Ingress URL, NodePort, or
kubectl port-forward
). - Log in with admin credentials.
-
Go to Connections โ Data sources:
-
Prometheus (metrics stack): reachable โ
- Loki (logging stack): reachable โ
- Tempo (tracing stack): reachable โ
- Explore tab: run a sample query for each datasource.
End-to-end trace + log correlation (optional)
If you installed tracing and logging:
- Emit a test span with your
aios_tracing
SDK. - Log a line with
trace_id
injected (via logging correlation). -
In Grafana Explore:
-
Start from a trace in Tempo โ jump to logs (trace-to-logs).
- Or start from logs โ jump to trace (logs-to-trace).
๐งฐ Troubleshooting
Symptom | Likely Cause | Fix |
---|---|---|
helm upgrade --install hangs |
Webhooks / CRDs not ready | kubectl get events -A ; retry with --wait or check CRDs |
Grafana 502 via Ingress | Wrong host/TLS or service type | Verify --grafana-ingress-host , Ingress controller logs |
Loki returns 500 on queries | Schema/object store mismatch | Ensure schema_config matches objstore choice; check compactor |
Promtail pods CrashLoopBackOff | HostPath or RBAC issues | Check DaemonSet events; disable --promtail-host-logs to isolate |
Tempo โno blocksโ | Object store creds or retention too short | Verify --tempo-s3-* flags; check compactor logs |
Prometheus out of space | PVC too small | Increase --pvc-prom-size and --prom-retention accordingly |
Log digging tips:
kubectl -n observability logs deploy/kps-grafana --tail=200
kubectl -n logging logs deploy/loki-distributed-distributor --tail=200
kubectl -n tracing logs deploy/tempo-distributed-gateway --tail=200
๐ Security & hardening tips
- Credentials: Avoid passing secrets via flags in CI logs. Prefer Kubernetes Secrets and reference in values.
- Network: Use Ingress with TLS and NetworkPolicies to restrict access to Grafana and data services.
- RBAC: Restrict Promtail/OTel permissions to only what they need.
- Multitenancy: Use Prometheus tenants or separate namespaces/releases if isolating teams.
- Retention: Set realistic
--retention-hours
and object store lifecycle rules to control costs.
๐งน Teardown & cleanup
# Uninstall stacks
k8s/deploy_observability_stack.sh --action uninstall --namespace observability
k8s/deploy_tracing.sh --action uninstall --namespace tracing
k8s/deploy_logging.sh --action uninstall --namespace logging
# (Optional) delete namespaces and PVs โ irreversible if reclaim policy=Delete
kubectl delete ns observability tracing logging
๐ก Operator tips
- StorageClass: When in doubt, set
--storage-class <class>
explicitly to avoid Pending PVCs. - S3/MinIO: For development, MinIO endpoints often require
--s3-insecure true
ands3forcepathstyle
. - Scaling: Use the distributed modes for Loki/Tempo when you need HA and horizontal scaling.
- Dashboards: Pre-provision Grafana dashboards and alerting rules through Helm values or ConfigMaps.
- CI/CD: Parameterize these scripts in your pipeline (env vars โ flags) for consistent cluster bring-up.