🚀 Deploying the Observability Stack

This section covers end-to-end deployment of the AIOS Observability Stack on Kubernetes using the helper scripts in k8s/. You’ll get quickstarts, full flag tables, and verification steps for:

Metrics stack (Prometheus, Grafana, optional Loki/Tempo/Thanos) → k8s/deploy_observability_stack.sh
Tracing stack (Tempo + OpenTelemetry Collector + optional Grafana) → k8s/deploy_tracing.sh
Logging stack (Loki + Promtail + optional Grafana) → k8s/deploy_logging.sh

✅ Prerequisites

A working Kubernetes cluster (v1.24+ recommended)
kubectl and helm installed and pointing to your cluster
(Optional) S3/MinIO credentials if you plan to use object storage
Sufficient storage class available for PVCs (or set --storage-class flags)

Tip: Pass a specific kubeconfig file with KUBECONFIG=/path/to/kubeconfig or add --kubeconfig to your kubectl/helm env as needed.

🧭 What gets installed (at a glance)

Stack	Components (typical)	Namespace
Metrics	Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, optional Loki, optional Tempo, Thanos Ruler	`observability`
Tracing	Grafana Tempo (single/distributed), OpenTelemetry Collector (agent + gateway), optional Grafana	`tracing`
Logging	Loki (single/distributed), Promtail, optional Grafana	`logging`

⚡️ TL;DR — One-liners

Metrics (Prometheus + Grafana; no Loki/Tempo)

k8s/deploy_observability_stack.sh \
  --action install \
  --namespace observability \
  --with-loki false \
  --with-tempo false

Tracing (Tempo distributed + OTel Collector + Grafana via NodePort)

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --with-grafana true \
  --grafana-enable-nodeport true --grafana-nodeport 32010 \
  --tempo-objstore filesystem

Logging (Loki single + Promtail + Grafana via NodePort)

k8s/deploy_logging.sh \
  --action install \
  --mode single \
  --with-promtail true \
  --with-grafana true \
  --persistence false \
  --grafana-enable-nodeport true --grafana-nodeport 32199

📦 Metrics Stack — `deploy_observability_stack.sh`

Common examples

Full stack with Loki + Thanos Ruler, Grafana via Ingress

k8s/deploy_observability_stack.sh \
  --action install \
  --namespace observability \
  --with-loki true \
  --with-tempo false \
  --with-thanos-ruler true \
  --thanos-s3-bucket aios-metrics \
  --thanos-s3-endpoint minio.minio.svc.cluster.local:9000 \
  --thanos-s3-access-key minioadmin \
  --thanos-s3-secret-key minioadmin \
  --thanos-s3-insecure true \
  --storage-class rook-ceph-block \
  --grafana-enable-ingress true --grafana-ingress-host grafana.example.com

Minimal metrics only (Prometheus + Grafana via port-forward)

k8s/deploy_observability_stack.sh \
  --action install \
  --with-loki false \
  --with-tempo false \
  --grafana-enable-nodeport false

Flag reference

Flag	Description	Default
`--action`	`install` \| `uninstall` \| `status` \| `port-forward`	`install`
`--namespace`	K8s namespace to install into	`observability`
`--with-loki`	Add Loki + Promtail alongside metrics	`true`
`--with-tempo`	Add Tempo (tracing) via Grafana chart (basic)	`false`
`--storage-class`	Storage class for all PVCs (if set)	`""`
`--prom-retention`	Prometheus retention window	`15d`
`--pvc-prom-size`	Prometheus PVC size	`20Gi`
`--pvc-grafana-size`	Grafana PVC size	`10Gi`
`--grafana-admin-user`	Grafana admin username	`admin`
`--grafana-admin-password`	Grafana admin password (auto-generated if empty)	`""`
`--grafana-enable-ingress`	Enable Grafana Ingress	`false`
`--grafana-ingress-host`	Host for Grafana Ingress	`""`
`--grafana-enable-nodeport`	Expose Grafana as NodePort	`false`
`--grafana-nodeport`	NodePort for Grafana	`32000`
`--prom-req/lim-*`	Prometheus CPU/memory requests/limits	see script defaults

Thanos Ruler & Sidecar

Flag	Description	Default
`--with-thanos-ruler`	Enable Thanos Ruler and Prometheus sidecar	`false`
`--thanos-ruler-replicas`	Ruler replicas	`1`
`--pvc-thanos-ruler-size`	Ruler PVC size	`10Gi`
`--thanos-objstore-type`	Object store type (`s3`)	`s3`
`--thanos-s3-bucket`	S3 bucket for Thanos	required if enabled
`--thanos-s3-endpoint`	S3/MinIO endpoint	required if enabled
`--thanos-s3-access-key`	Access key	required if enabled
`--thanos-s3-secret-key`	Secret key	required if enabled
`--thanos-s3-insecure`	Allow HTTP/self-signed	`false`
`--thanos-s3-prefix`	Optional bucket prefix	`""`

Verify

kubectl -n observability get pods
kubectl -n observability get svc
# Grafana (port-forward if no Ingress/NodePort):
kubectl -n observability port-forward svc/kps-grafana 3000:80
open http://localhost:3000

🔎 Tracing Stack — `deploy_tracing.sh`

Common examples

Distributed Tempo + OTel Collector + Grafana via NodePort

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --with-grafana true \
  --grafana-enable-nodeport true --grafana-nodeport 32010 \
  --tempo-objstore filesystem

Tempo single-binary + OTel Collector (dev)

k8s/deploy_tracing.sh \
  --action install \
  --tempo-mode single \
  --tempo-objstore filesystem \
  --with-grafana false

Tempo distributed with MinIO

k8s/deploy_tracing.sh \
  --action install \
  --namespace tracing \
  --tempo-objstore s3 \
  --tempo-s3-bucket traces \
  --tempo-s3-endpoint minio.minio.svc.cluster.local:9000 \
  --tempo-s3-access-key minioadmin \
  --tempo-s3-secret-key minioadmin \
  --tempo-s3-insecure true \
  --with-grafana true

Flag reference

Flag	Description	Default
`--action`	`install` \| `uninstall` \| `status` \| `port-forward`	`install`
`--namespace`	Target namespace	`tracing`
`--with-tempo`	Deploy Tempo	`true`
`--tempo-mode`	`distributed` \| `single`	`distributed`
`--tempo-retention`	Retention window (`48h`, etc.)	`48h`
`--tempo-objstore`	`s3` \| `gcs` \| `azure` \| `filesystem`	`s3`
`--tempo-s3-*`	S3/MinIO connection (bucket/endpoint/keys/insecure/prefix)	—
`--pvc-tempo-*`	PVC sizes for WAL/ingester/storegw/compactor/querier	see script
`--with-otel`	Deploy OTel Collector (agent+gateway)	`true`
`--otel-sampling-ratio`	0.0–1.0 probability sampling	`1.0`
`--otel-gateway-replicas`	OTel gateway replicas	`2`
`--otel-enable-logs-pipeline`	Also forward OTLP logs	`false`
`--with-grafana`	Deploy Grafana + Tempo datasource	`false`
`--grafana-*`	Admin / ingress / nodeport / PVC options	see script

Verify

kubectl -n tracing get pods
# Port-forward Tempo query-frontend:
kubectl -n tracing port-forward svc/tempo-distributed-query-frontend 3100:3100
# Grafana (if enabled):
kubectl -n tracing port-forward svc/grafana 3000:80

Send a test trace (Python)

# point to your OTel Collector service in 'tracing' namespace
from aios_tracing.tracing import TracingSDK
sdk = TracingSDK(service_name="smoke-test", otlp_endpoint="otel-collector.tracing.svc:4317")
@sdk.trace() 
def hello(): return "world"
hello()

🪵 Logging Stack — `deploy_logging.sh`

Loki-distributed (HA) + Promtail + Grafana via Ingress

k8s/deploy_logging.sh \
  --action install \
  --mode distributed \
  --with-promtail true \
  --with-grafana true \
  --grafana-enable-ingress true \
  --grafana-ingress-host grafana.example.com \
  --objstore s3 \
  --s3-bucket logs \
  --s3-endpoint s3.amazonaws.com \
  --s3-access-key AKIA... \
  --s3-secret-key ... \
  --retention-hours 168

Flag reference

Flag	Description	Default
`--action`	`install` \| `uninstall` \| `status` \| `port-forward`	`install`
`--namespace`	Namespace to deploy to	`logging`
`--release`	Helm release prefix/name	`loki`
`--mode`	`single` (loki-stack) \| `distributed` (loki-distributed)	`single`
`--with-promtail`	Deploy Promtail daemonset	`true`
`--with-grafana`	Deploy Grafana	`true`
`--persistence`	Enable Loki PVCs (single mode)	`false`
`--pvc-size`	Loki PVC size (single mode)	`20Gi`
`--storage-class`	PVC storageClassName (all PVCs)	`""`
`--objstore`	`filesystem` \| `s3` (for chunks/index)	`filesystem`
`--s3-bucket`	S3/MinIO bucket	—
`--s3-endpoint`	S3 endpoint or `host:port` for MinIO	—
`--s3-access-key`	S3 access key	—
`--s3-secret-key`	S3 secret key	—
`--s3-insecure`	Allow HTTP/self-signed (MinIO)	`false`
`--s3-prefix`	Optional key prefix	`""`
`--retention-hours`	Log retention (compactor/limits)	`168`
`--promtail-extra-labels`	Extra labels `k=v,k2=v2`	`""`
`--promtail-namespaces`	`all` or comma list (`ns1,ns2`)	`all`
`--promtail-host-logs`	Collect `/var/log/*log`	`true`
`--grafana-admin-user`	Grafana admin user	`admin`
`--grafana-admin-password`	Grafana admin password	`admin`
`--grafana-enable-ingress`	Enable Ingress for Grafana	`false`
`--grafana-ingress-host`	Ingress host	`""`
`--grafana-enable-nodeport`	Expose Grafana via NodePort	`true`
`--grafana-nodeport`	Grafana NodePort	`32199`
`--grafana-pvc-size`	Grafana PVC size	`10Gi`
`--pvc-ingester-size`	(distributed) Ingester PVC	`20Gi`
`--pvc-storegw-size`	(distributed) Store-Gateway PVC	`20Gi`
`--pvc-compactor-size`	(distributed) Compactor PVC	`10Gi`

Verify

# Pods & services
kubectl -n logging get pods
kubectl -n logging get svc

# Loki endpoint:
#  - single:      svc/loki (port 3100)
#  - distributed: svc/loki-query-frontend (port 3100)
kubectl -n logging port-forward svc/loki-query-frontend 3100:3100  # if distributed

# Grafana (NodePort or port-forward)
kubectl -n logging port-forward svc/loki-grafana 3000:80

Promtail quick sanity checks

# Ensure promtail is scraping pods
kubectl -n logging get ds -l app=promtail
kubectl -n logging logs ds/loki-promtail -c promtail --tail=200

# Query in Grafana Explore with datasource "Loki"
{job="kubernetes-pods"} |= "ERROR"

🧪 Post-install smoke tests

Grafana log-in + datasource checks

Open Grafana (Ingress URL, NodePort, or kubectl port-forward).
Log in with admin credentials.
Go to Connections → Data sources:
Prometheus (metrics stack): reachable ✅
Loki (logging stack): reachable ✅
Tempo (tracing stack): reachable ✅
Explore tab: run a sample query for each datasource.

End-to-end trace + log correlation (optional)

If you installed tracing and logging:

Emit a test span with your aios_tracing SDK.
Log a line with trace_id injected (via logging correlation).
In Grafana Explore:
Start from a trace in Tempo → jump to logs (trace-to-logs).
Or start from logs → jump to trace (logs-to-trace).

🧰 Troubleshooting

Symptom	Likely Cause	Fix
`helm upgrade --install` hangs	Webhooks / CRDs not ready	`kubectl get events -A`; retry with `--wait` or check CRDs
Grafana 502 via Ingress	Wrong host/TLS or service type	Verify `--grafana-ingress-host`, Ingress controller logs
Loki returns 500 on queries	Schema/object store mismatch	Ensure `schema_config` matches `objstore` choice; check compactor
Promtail pods CrashLoopBackOff	HostPath or RBAC issues	Check DaemonSet events; disable `--promtail-host-logs` to isolate
Tempo “no blocks”	Object store creds or retention too short	Verify `--tempo-s3-*` flags; check compactor logs
Prometheus out of space	PVC too small	Increase `--pvc-prom-size` and `--prom-retention` accordingly

Log digging tips:

kubectl -n observability logs deploy/kps-grafana --tail=200
kubectl -n logging logs deploy/loki-distributed-distributor --tail=200
kubectl -n tracing logs deploy/tempo-distributed-gateway --tail=200

🔐 Security & hardening tips

Credentials: Avoid passing secrets via flags in CI logs. Prefer Kubernetes Secrets and reference in values.
Network: Use Ingress with TLS and NetworkPolicies to restrict access to Grafana and data services.
RBAC: Restrict Promtail/OTel permissions to only what they need.
Multitenancy: Use Prometheus tenants or separate namespaces/releases if isolating teams.
Retention: Set realistic --retention-hours and object store lifecycle rules to control costs.

🧹 Teardown & cleanup

# Uninstall stacks
k8s/deploy_observability_stack.sh --action uninstall --namespace observability
k8s/deploy_tracing.sh --action uninstall --namespace tracing
k8s/deploy_logging.sh --action uninstall --namespace logging

# (Optional) delete namespaces and PVs — irreversible if reclaim policy=Delete
kubectl delete ns observability tracing logging

💡 Operator tips

StorageClass: When in doubt, set --storage-class <class> explicitly to avoid Pending PVCs.
S3/MinIO: For development, MinIO endpoints often require --s3-insecure true and s3forcepathstyle.
Scaling: Use the distributed modes for Loki/Tempo when you need HA and horizontal scaling.
Dashboards: Pre-provision Grafana dashboards and alerting rules through Helm values or ConfigMaps.
CI/CD: Parameterize these scripts in your pipeline (env vars → flags) for consistent cluster bring-up.

🚀 Deploying the Observability Stack

✅ Prerequisites

🧭 What gets installed (at a glance)

⚡️ TL;DR — One-liners

Metrics (Prometheus + Grafana; no Loki/Tempo)

Tracing (Tempo distributed + OTel Collector + Grafana via NodePort)

Logging (Loki single + Promtail + Grafana via NodePort)

📦 Metrics Stack — deploy_observability_stack.sh

Common examples

Flag reference

🔎 Tracing Stack — deploy_tracing.sh

Common examples

Flag reference

🪵 Logging Stack — deploy_logging.sh

Flag reference

Verify

Promtail quick sanity checks

🧪 Post-install smoke tests

Grafana log-in + datasource checks

End-to-end trace + log correlation (optional)

🧰 Troubleshooting

🔐 Security & hardening tips

🧹 Teardown & cleanup

💡 Operator tips

📦 Metrics Stack — `deploy_observability_stack.sh`

🔎 Tracing Stack — `deploy_tracing.sh`

🪵 Logging Stack — `deploy_logging.sh`