Skip to content

Latest commit

 

History

History
191 lines (138 loc) · 5.26 KB

File metadata and controls

191 lines (138 loc) · 5.26 KB

Observability Setup for 01Agents

This guide covers the optional observability stack for monitoring, logging, and tracing 01Agents.

Components

The observability stack consists of:

  • Prometheus & Grafana: Metrics collection and visualization.
  • Loki: Log aggregation.
  • Tempo: Distributed tracing.
  • OpenTelemetry (OTEL): Standardized telemetry collection.

1. Prometheus with Grafana

helm upgrade --install prometheus oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.service.type=NodePort \
  --set prometheus.service.nodePort=30090 \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30080 \
  --set grafana.adminPassword=admin \
  --wait

Prometheus Scrape Config (Agent Metrics)

To scrape metrics from the agents, add the following scrape job:

helm upgrade --install prometheus oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack   -n monitoring   --set-file prometheus.prometheusSpec.additionalScrapeConfigs=helm-chart/additional-scrape-configs.yaml --create-namespace

2. Loki (Log Aggregation)

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
  --namespace logging \
  --create-namespace \
  --set loki.auth_enabled=false \
  --set deploymentMode=SingleBinary \
  --set singleBinary.replicas=1 \
  --set loki.commonConfig.replication_factor=1 \
  --set loki.storage.type=filesystem \
  --set minio.enabled=false \
  --set backend.replicas=0 \
  --set read.replicas=0 \
  --set write.replicas=0 \
  --set loki.useTestSchema=true

3. Tempo (Distributed Tracing)

Enable metrics generator by creating helm-chart/tempo-values.yaml:

tempo:
  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/write"
  storage:
    trace:
      backend: local

Install/Upgrade Tempo:

helm upgrade --install tempo grafana/tempo --namespace tracing -f helm-chart/tempo-values.yaml --create-namespace

4. OpenTelemetry Operator

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry \
  --create-namespace \
  --set manager.collectorImage.repository=otel/opentelemetry-collector-contrib \
  --set admissionWebhooks.certManager.enabled=false \
  --set admissionWebhooks.autoGenerateCert.enabled=true

5. OpenTelemetry Collector

Create helm-chart/otel-collector.yaml (see original README for content) and apply it:

kubectl apply -f helm-chart/otel-collector.yaml

6. Grafana Datasources

Add these in Grafana (Connections → Data Sources):

Datasource URL
Prometheus http://prometheus-kube-prometheus-prometheus.monitoring:9090/
Tempo http://tempo.tracing.svc.cluster.local:3200
Loki http://loki-gateway.logging.svc.cluster.local

7. Import Dashboards

You can import the Level-1 Agent dashboard by following these steps:

  1. Open Grafana.
  2. Go to DashboardsNewImport.
  3. Copy and paste the content of level-1-agent.json into the "Import via panel json" box, or upload the file.
  4. Click Load and then Import.

Agent Configuration for Observability

Telemetry features are disabled by default. They should only be enabled after the respective observability components (OTEL Collector, Tempo, Loki) have been deployed as described in sections 1-5 above.

1. LangChain / LangSmith (via OTEL)

To enable tracing for LangChain/LangGraph workflows using the OTEL Collector:

  1. Ensure the OTEL Collector is running and reachable.
  2. Update values.yaml for the l1 and l2 agents:
env:
  LANGSMITH_TRACING: "true"
  LANGSMITH_OTEL_ENABLED: "true"
  LANGSMITH_OTEL_ONLY: "true"
  OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector-collector.opentelemetry.svc.cluster.local:4318
  OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf

2. Traceloop (OpenLLMetry)

To enable Traceloop for LLM instrumentation:

  1. Ensure your OTEL Collector is configured to receive Traceloop data.
  2. Update values.yaml for the agents:
env:
  TRACELOOP_ENABLED: "true"
  TRACELOOP_BASE_URL: http://otel-collector-collector.opentelemetry.svc.cluster.local:4318

Note

When LANGSMITH_OTEL_ENABLED is true, LangChain traces are sent to the OTEL collector. You can use Traceloop independently or together depending on your requirements.

3. OTEL Logs (to Loki)

To send application logs to Loki via the OTEL Collector:

env:
  OTEL_ENABLED: "true"
  OTEL_SERVICE_NAME: l1-agent # or l2-agent
  OTEL_EXPORTER_OTLP_LOGS_ENDPOINT: http://otel-collector-collector.opentelemetry.svc.cluster.local:4318/v1/logs

Port Forwards

# OTEL Collector
kubectl port-forward svc/otel-collector-collector -n opentelemetry 4318:4318

# Grafana (Login: admin/admin)
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80

# Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090