Skip to content

Outpost resilience and graceful degradation

The Outpost is designed to keep processing AI requests even when it cannot reach the Arbitex Platform management plane. This document describes the circuit breaker pattern, graceful degradation mode, and the audit queue that preserves audit integrity during outages.


The Outpost communicates with the Platform for two purposes:

  1. Policy sync — pulling updated policy bundles, routing rules, and DLP rule sets.
  2. Audit sync — pushing audit events to the Platform’s tamper-evident log.

When the Platform is unreachable, the Outpost:

  • Serves AI requests using the last cached policy bundle (in-memory).
  • Writes audit events to a local SQLite queue instead of dropping them.
  • Blocks new Platform calls through the circuit breaker to prevent cascading timeouts.
  • Flushes the audit queue automatically when connectivity is restored.

The Outpost uses a three-state circuit breaker to prevent repeated failed calls to the Platform from stacking up timeouts.

failure_threshold reached
CLOSED ─────────────────────────────► OPEN
▲ │
│ │ recovery_timeout_seconds elapsed
│ ▼
└──── probe succeeds ──────────── HALF_OPEN
(probe fails → OPEN resets)
StateBehaviour
CLOSEDNormal operation. All Platform calls pass through. Consecutive failures are counted.
OPENCircuit is tripped. All Platform calls are rejected immediately (no network I/O). Recovery timer runs.
HALF_OPENRecovery probe state. A limited number of test calls are allowed through. Success closes the circuit; failure reopens it and resets the timer.
ParameterDefaultDescription
failure_threshold5Consecutive failures before opening
recovery_timeout_seconds60Seconds in OPEN state before probing
half_open_max_calls1Probe calls allowed in HALF_OPEN

On a successful probe call, the failure counter is reset and the circuit closes. On a probe failure, the circuit reopens and the 60-second recovery timer resets.

The Outpost maintains a per-dependency registry of breakers. Each dependency (policy_sync, audit_sync, etc.) has its own isolated breaker so a Platform audit endpoint outage does not block policy syncs.

The overall degradation state is True when any named breaker is OPEN.

The /admin/status endpoint (port 8301) reports circuit breaker state:

{
"status": "warn",
"degradation_mode": true,
"degraded_since": "2026-03-12T14:22:00Z",
"checks": [
{ "name": "policy_sync", "status": "warn", "message": "circuit open" },
{ "name": "audit_sync", "status": "pass", "message": "ok" }
]
}

When at least one circuit breaker enters OPEN state, the Outpost enters degradation mode.

FeatureNormal operationDegradation mode
Policy enforcementLive policy bundleCached policy bundle (last successful sync)
Audit eventsSync to Platform in real timeQueued to local SQLite
Policy syncPolled every POLICY_SYNC_INTERVAL_SECONDSBlocked (circuit open)
AI proxy trafficContinues normallyContinues normally
/readyz probe200 OK200 OK (Outpost is still ready)

The Outpost intentionally keeps serving traffic in degradation mode. Policy decisions use the last cached bundle — a stale bundle is preferable to blocking all AI requests.

The cached policy bundle is held in memory. On startup the Outpost loads the bundle from a local cache file before attempting a live sync, so cold restarts during a Platform outage are handled gracefully.

The bundle includes a freshness guard: bundles older than 1 hour (BUNDLE_MAX_AGE_SECONDS = 3600) are rejected and the Outpost will not serve traffic until a fresh bundle is loaded.

WARN outpost.degradation — Outpost entering degradation mode — reason=circuit breaker opened degraded_since=1741789320.123
INFO outpost.degradation — Outpost exiting degradation mode — platform reachable again (was degraded for 187.4s)

Monitor for entering degradation mode log lines in your SIEM or Grafana log dashboard.


When Platform audit sync fails, events are buffered to a local SQLite database. This prevents audit data loss during outages.

CREATE TABLE IF NOT EXISTS local_audit_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
event_json TEXT NOT NULL,
queued_at REAL NOT NULL, -- Unix timestamp
retry_count INTEGER NOT NULL DEFAULT 0
);
  • Enqueue: Each audit event is serialised to JSON and inserted with queued_at = now().
  • Dequeue: The sync worker fetches events oldest-first in batches of 50 when the circuit closes.
  • Mark synced: Rows are deleted after confirmed delivery to Platform.
  • Retry count: Incremented on each failed flush attempt. High retry counts indicate persistent sync issues.

The SQLite database path is configured via AUDIT_QUEUE_DB_PATH (default: data/audit_queue.db). In Kubernetes deployments, this path should be on a persistent volume so the queue survives pod restarts.

# Helm values: ensure audit queue survives pod restarts
outpost:
persistence:
enabled: true
mountPath: /app/data
env:
AUDIT_QUEUE_DB_PATH: /app/data/audit_queue.db

The admin API (port 8301) exposes three endpoints for managing the audit queue. All require a valid Authorization: Bearer <admin-key> header.

Returns current queue statistics.

Terminal window
curl -s http://localhost:8301/admin/audit-queue/status \
-H "Authorization: Bearer $OUTPOST_ADMIN_KEY"

Response:

{
"count": 142,
"oldest_queued_at": "2026-03-12T14:22:00Z",
"disk_usage_bytes": 49152,
"degradation_mode": true
}
FieldDescription
countNumber of events currently queued
oldest_queued_atISO-8601 timestamp of the oldest unsynced event
disk_usage_bytesSize of the SQLite database file
degradation_modeWhether the Outpost is currently in degraded mode

Triggers an immediate flush of all queued events to Platform. Returns 409 if the queue is empty (not degraded).

Terminal window
curl -s -X POST http://localhost:8301/admin/audit-queue/flush \
-H "Authorization: Bearer $OUTPOST_ADMIN_KEY"

Response (success):

{
"flushed": 142,
"failed": 0,
"duration_seconds": 1.24
}

Use this endpoint after confirming Platform connectivity is restored to accelerate queue draining without waiting for the next sync interval.

⚠ Destructive. Permanently deletes all events from the local queue. Purged events are not sent to Platform and will be absent from the audit log.

Requires the X-Purge-Confirm: yes header to prevent accidental data loss.

Terminal window
curl -s -X DELETE http://localhost:8301/admin/audit-queue/purge \
-H "Authorization: Bearer $OUTPOST_ADMIN_KEY" \
-H "X-Purge-Confirm: yes"

Response:

{ "purged": 142 }

Only purge if you have confirmed that the queued events are corrupt or irrelevant (e.g., after a test environment reset). For production incidents, always prefer flush over purge.


ScenarioExpected durationRecommended action
Platform rolling restart< 2 minutesMonitor; auto-recovers
Platform maintenance windowMinutes–hoursPre-notify Outpost operators; queue will flush on recovery
Network partitionUnknownMonitor queue size; alert if queue > 10,000 events
Platform total outageHours+Escalate; consider purge only if events are non-compliant noise

Set a Prometheus alert on:

- alert: OutpostDegradedMode
expr: outpost_platform_reachable == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Outpost {{ $labels.outpost_id }} is in degraded mode"
- alert: OutpostAuditQueueHigh
expr: outpost_audit_queue_depth > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Outpost audit queue depth is high ({{ $value }} events)"

Navigate to http://localhost:8301 in a browser (requires kubectl port-forward in Kubernetes). The Health panel shows:

  • Platform reachability status
  • Degradation mode indicator and duration
  • Audit queue depth with a flush button