Observability Stack
OpenTelemetry tracing, Prometheus metrics, and structured logging across the entire platform.
The Problem
When an AI agent reasons through a multi-step task -- dispatching jobs, querying memory, calling extensions -- you need to understand what happened and why. Traditional application logging tells you what occurred in a single process, but an agent spans multiple pods, NATS queues, and external services. Without distributed tracing, debugging a bad response means guessing which component went wrong. Without metrics, you cannot tell if the system is healthy until a user complains.
How Baker Street Solves It
Baker Street ships an optional observability stack that deploys to a separate Kubernetes namespace. Every component is instrumented with OpenTelemetry, providing end-to-end visibility across the entire agent pipeline.
The stack includes:
- OpenTelemetry Collector -- receives OTLP telemetry from the Brain and Workers, fans out to storage backends.
- Tempo -- distributed trace storage. Every API response includes an
X-Trace-Idheader. Trace context propagates through NATS messages, so a single user request can be traced from Brain to NATS to Worker and back. - Loki -- log aggregation with structured JSON logging. Filter by service, conversation ID, job ID, or any custom label.
- Prometheus -- metrics collection for request rates, latencies, job queue depth, memory store size, LLM token usage, and error rates. Supports remote-write to external Prometheus or Grafana Cloud.
- Grafana -- pre-built dashboards for agent health, job throughput, memory utilization, and LLM cost tracking.
LLM calls are instrumented as OpenTelemetry spans with model name, role, iteration count, and token usage as attributes. Tool invocations appear as child spans. You can see exactly how the agent reasoned through a request: which tools it called, in what order, how long each step took, and what the token cost was.
Example
# k8s/observability/otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: baker-street-observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
exporters:
otlp/tempo:
endpoint: tempo.baker-street-observability:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki.baker-street-observability:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Learn More
See the Observability documentation for deployment instructions, custom dashboard creation, and alerting configuration.