Observability Stack

OpenTelemetry tracing, Prometheus metrics, and structured logging across the entire platform.

The Problem

When an AI agent reasons through a multi-step task -- dispatching jobs, querying memory, calling extensions -- you need to understand what happened and why. Traditional application logging tells you what occurred in a single process, but an agent spans multiple pods, NATS queues, and external services. Without distributed tracing, debugging a bad response means guessing which component went wrong. Without metrics, you cannot tell if the system is healthy until a user complains.

How Baker Street Solves It

Baker Street ships an optional observability stack that deploys to a separate Kubernetes namespace. Every component is instrumented with OpenTelemetry, providing end-to-end visibility across the entire agent pipeline.

The stack includes:

OpenTelemetry Collector -- receives OTLP telemetry from the Brain and Workers, fans out to storage backends.
Tempo -- distributed trace storage. Every API response includes an X-Trace-Id header. Trace context propagates through NATS messages, so a single user request can be traced from Brain to NATS to Worker and back.
Loki -- log aggregation with structured JSON logging. Filter by service, conversation ID, job ID, or any custom label.
Prometheus -- metrics collection for request rates, latencies, job queue depth, memory store size, LLM token usage, and error rates. Supports remote-write to external Prometheus or Grafana Cloud.
Grafana -- pre-built dashboards for agent health, job throughput, memory utilization, and LLM cost tracking.

LLM calls are instrumented as OpenTelemetry spans with model name, role, iteration count, and token usage as attributes. Tool invocations appear as child spans. You can see exactly how the agent reasoned through a request: which tools it called, in what order, how long each step took, and what the token cost was.

Example

# k8s/observability/otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: baker-street-observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024

    exporters:
      otlp/tempo:
        endpoint: tempo.baker-street-observability:4317
        tls:
          insecure: true
      prometheus:
        endpoint: 0.0.0.0:8889
      loki:
        endpoint: http://loki.baker-street-observability:3100/loki/api/v1/push

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [loki]

Learn More

See the Observability documentation for deployment instructions, custom dashboard creation, and alerting configuration.

Read the documentation