Guardrails

Configure AI guardrail middleware for tool call validation, approval workflows, and output sanitization.

Guardrails

The guardrail system is a composable middleware chain that intercepts every tool call before execution. Each guardrail performs a specific check and can approve, reject, or escalate the call. Guardrails run in sequence, and any single rejection stops execution.

Guardrail Chain

The default enterprise configuration includes these guardrails in order:

Prompt Injection Detection -- scans tool inputs for known injection patterns
Schema Validation -- validates tool call arguments against the tool's JSON Schema
Destructive Operation Gate -- flags operations that modify infrastructure (delete, restart, scale)
Human-in-the-Loop Approval -- routes flagged operations to a human for approval
Output Sanitization -- scrubs secrets and PII from tool outputs before returning to the LLM

Configuration

Guardrails are configured via a YAML file mounted as a ConfigMap:

# k8s/config/guardrails.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: guardrail-config
  namespace: baker-street
data:
  guardrails.yaml: |
    chain:
      - name: prompt-injection
        enabled: true
        config:
          sensitivity: high
          patterns:
            - "ignore previous instructions"
            - "you are now"
            - "system prompt override"

      - name: schema-validation
        enabled: true

      - name: destructive-gate
        enabled: true
        config:
          flagged_tools:
            - "command_execute"
            - "launch_task"
          flagged_patterns:
            - "delete"
            - "rm -rf"
            - "kubectl delete"
            - "DROP TABLE"
          action: require_approval

      - name: human-approval
        enabled: true
        config:
          timeout_seconds: 300
          notification_channel: "telegram"
          auto_deny_on_timeout: true

      - name: output-sanitization
        enabled: true
        config:
          patterns:
            - type: regex
              match: "sk-ant-[a-zA-Z0-9]{20,}"
              replace: "[REDACTED_API_KEY]"
            - type: regex
              match: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
              replace: "[REDACTED_EMAIL]"

Command Allowlisting

Workers enforce a strict allowlist for shell command execution. Only pre-approved binaries can run:

command_allowlist:
  - kubectl
  - helm
  - curl
  - jq
  - uptime
  - df
  - free
  - ps

Commands not on the allowlist are rejected before execution. Environment variables containing secrets (matching patterns like *_KEY, *_TOKEN, *_SECRET) are stripped from the shell environment.

Approval Workflow

When a guardrail flags an operation for human approval:

The tool call is paused and written to a pending approvals queue
A notification is sent to the configured channel (Telegram, Discord, or Web UI)
The notification includes the tool name, arguments, and the conversation context
The human approves or denies within the configured timeout
On approval, the tool call proceeds. On denial or timeout, it is rejected with a message to Claude

[Approval Request]
Tool: command_execute
Command: kubectl delete pod worker-abc123 -n baker-street
Context: User asked to restart the worker pod
Timeout: 5 minutes

[Approve] [Deny]

Custom Guardrails

Implement the GuardrailHook interface from @bakerst/core to create custom guardrails:

import { GuardrailHook, ToolCall, GuardrailResult } from '@bakerst/core';

export class CostLimitGuardrail implements GuardrailHook {
  name = 'cost-limit';

  async check(call: ToolCall): Promise<GuardrailResult> {
    const dailyCost = await getDailyLLMCost();
    if (dailyCost > this.config.maxDailyCost) {
      return {
        action: 'reject',
        reason: `Daily LLM cost limit ($${this.config.maxDailyCost}) exceeded`,
      };
    }
    return { action: 'approve' };
  }
}