Outpost resilience and graceful degradation

The Outpost is designed to keep processing AI requests even when it cannot reach the Arbitex Platform management plane. This document describes the circuit breaker pattern, graceful degradation mode, and the audit queue that preserves audit integrity during outages.

Overview

The Outpost communicates with the Platform for two purposes:

Policy sync — pulling updated policy bundles, routing rules, and DLP rule sets.
Audit sync — pushing audit events to the Platform’s tamper-evident log.

When the Platform is unreachable, the Outpost:

Serves AI requests using the last cached policy bundle (in-memory).
Writes audit events to a local SQLite queue instead of dropping them.
Blocks new Platform calls through the circuit breaker to prevent cascading timeouts.
Flushes the audit queue automatically when connectivity is restored.

Circuit breaker

The Outpost uses a three-state circuit breaker to prevent repeated failed calls to the Platform from stacking up timeouts.

States

         failure_threshold reached
CLOSED ─────────────────────────────► OPEN
  ▲                                     │
  │                                     │ recovery_timeout_seconds elapsed
  │                                     ▼
  └──── probe succeeds ──────────── HALF_OPEN
         (probe fails → OPEN resets)

State	Behaviour
CLOSED	Normal operation. All Platform calls pass through. Consecutive failures are counted.
OPEN	Circuit is tripped. All Platform calls are rejected immediately (no network I/O). Recovery timer runs.
HALF_OPEN	Recovery probe state. A limited number of test calls are allowed through. Success closes the circuit; failure reopens it and resets the timer.

Default thresholds

Parameter	Default	Description
`failure_threshold`	5	Consecutive failures before opening
`recovery_timeout_seconds`	60	Seconds in OPEN state before probing
`half_open_max_calls`	1	Probe calls allowed in HALF_OPEN

On a successful probe call, the failure counter is reset and the circuit closes. On a probe failure, the circuit reopens and the 60-second recovery timer resets.

Named breakers

The Outpost maintains a per-dependency registry of breakers. Each dependency (policy_sync, audit_sync, etc.) has its own isolated breaker so a Platform audit endpoint outage does not block policy syncs.

The overall degradation state is True when any named breaker is OPEN.

Health check exposure

The /admin/status endpoint (port 8301) reports circuit breaker state:

{
  "status": "warn",
  "degradation_mode": true,
  "degraded_since": "2026-03-12T14:22:00Z",
  "checks": [
    { "name": "policy_sync", "status": "warn", "message": "circuit open" },
    { "name": "audit_sync",  "status": "pass", "message": "ok" }
  ]
}

Graceful degradation mode

When at least one circuit breaker enters OPEN state, the Outpost enters degradation mode.

What changes in degradation mode

Feature	Normal operation	Degradation mode
Policy enforcement	Live policy bundle	Cached policy bundle (last successful sync)
Audit events	Sync to Platform in real time	Queued to local SQLite
Policy sync	Polled every `POLICY_SYNC_INTERVAL_SECONDS`	Blocked (circuit open)
AI proxy traffic	Continues normally	Continues normally
`/readyz` probe	200 OK	200 OK (Outpost is still ready)

The Outpost intentionally keeps serving traffic in degradation mode. Policy decisions use the last cached bundle — a stale bundle is preferable to blocking all AI requests.

Policy bundle cache

The cached policy bundle is held in memory. On startup the Outpost loads the bundle from a local cache file before attempting a live sync, so cold restarts during a Platform outage are handled gracefully.

The bundle includes a freshness guard: bundles older than 1 hour (BUNDLE_MAX_AGE_SECONDS = 3600) are rejected and the Outpost will not serve traffic until a fresh bundle is loaded.

Degradation log messages

WARN  outpost.degradation — Outpost entering degradation mode — reason=circuit breaker opened degraded_since=1741789320.123
INFO  outpost.degradation — Outpost exiting degradation mode — platform reachable again (was degraded for 187.4s)

Monitor for entering degradation mode log lines in your SIEM or Grafana log dashboard.

Local SQLite audit queue

When Platform audit sync fails, events are buffered to a local SQLite database. This prevents audit data loss during outages.

Queue schema

CREATE TABLE IF NOT EXISTS local_audit_queue (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    event_json  TEXT    NOT NULL,
    queued_at   REAL    NOT NULL,     -- Unix timestamp
    retry_count INTEGER NOT NULL DEFAULT 0
);

Queue behaviour

Enqueue: Each audit event is serialised to JSON and inserted with queued_at = now().
Dequeue: The sync worker fetches events oldest-first in batches of 50 when the circuit closes.
Mark synced: Rows are deleted after confirmed delivery to Platform.
Retry count: Incremented on each failed flush attempt. High retry counts indicate persistent sync issues.

Queue location

The SQLite database path is configured via AUDIT_QUEUE_DB_PATH (default: data/audit_queue.db). In Kubernetes deployments, this path should be on a persistent volume so the queue survives pod restarts.

# Helm values: ensure audit queue survives pod restarts
outpost:
  persistence:
    enabled: true
    mountPath: /app/data
  env:
    AUDIT_QUEUE_DB_PATH: /app/data/audit_queue.db

Audit queue management endpoints

The admin API (port 8301) exposes three endpoints for managing the audit queue. All require a valid Authorization: Bearer <admin-key> header.

GET /admin/audit-queue/status

Returns current queue statistics.

curl -s http://localhost:8301/admin/audit-queue/status \
  -H "Authorization: Bearer $OUTPOST_ADMIN_KEY"

Response:

{
  "count": 142,
  "oldest_queued_at": "2026-03-12T14:22:00Z",
  "disk_usage_bytes": 49152,
  "degradation_mode": true
}

Field	Description
`count`	Number of events currently queued
`oldest_queued_at`	ISO-8601 timestamp of the oldest unsynced event
`disk_usage_bytes`	Size of the SQLite database file
`degradation_mode`	Whether the Outpost is currently in degraded mode

POST /admin/audit-queue/flush

Triggers an immediate flush of all queued events to Platform. Returns 409 if the queue is empty (not degraded).

curl -s -X POST http://localhost:8301/admin/audit-queue/flush \
  -H "Authorization: Bearer $OUTPOST_ADMIN_KEY"

Response (success):

{
  "flushed": 142,
  "failed": 0,
  "duration_seconds": 1.24
}

Use this endpoint after confirming Platform connectivity is restored to accelerate queue draining without waiting for the next sync interval.

DELETE /admin/audit-queue/purge

⚠ Destructive. Permanently deletes all events from the local queue. Purged events are not sent to Platform and will be absent from the audit log.

Requires the X-Purge-Confirm: yes header to prevent accidental data loss.

curl -s -X DELETE http://localhost:8301/admin/audit-queue/purge \
  -H "Authorization: Bearer $OUTPOST_ADMIN_KEY" \
  -H "X-Purge-Confirm: yes"