Outpost resilience and graceful degradation
The Outpost is designed to keep processing AI requests even when it cannot reach the Arbitex Platform management plane. This document describes the circuit breaker pattern, graceful degradation mode, and the audit queue that preserves audit integrity during outages.
Overview
Section titled “Overview”The Outpost communicates with the Platform for two purposes:
- Policy sync — pulling updated policy bundles, routing rules, and DLP rule sets.
- Audit sync — pushing audit events to the Platform’s tamper-evident log.
When the Platform is unreachable, the Outpost:
- Serves AI requests using the last cached policy bundle (in-memory).
- Writes audit events to a local SQLite queue instead of dropping them.
- Blocks new Platform calls through the circuit breaker to prevent cascading timeouts.
- Flushes the audit queue automatically when connectivity is restored.
Circuit breaker
Section titled “Circuit breaker”The Outpost uses a three-state circuit breaker to prevent repeated failed calls to the Platform from stacking up timeouts.
States
Section titled “States” failure_threshold reachedCLOSED ─────────────────────────────► OPEN ▲ │ │ │ recovery_timeout_seconds elapsed │ ▼ └──── probe succeeds ──────────── HALF_OPEN (probe fails → OPEN resets)| State | Behaviour |
|---|---|
| CLOSED | Normal operation. All Platform calls pass through. Consecutive failures are counted. |
| OPEN | Circuit is tripped. All Platform calls are rejected immediately (no network I/O). Recovery timer runs. |
| HALF_OPEN | Recovery probe state. A limited number of test calls are allowed through. Success closes the circuit; failure reopens it and resets the timer. |
Default thresholds
Section titled “Default thresholds”| Parameter | Default | Description |
|---|---|---|
failure_threshold | 5 | Consecutive failures before opening |
recovery_timeout_seconds | 60 | Seconds in OPEN state before probing |
half_open_max_calls | 1 | Probe calls allowed in HALF_OPEN |
On a successful probe call, the failure counter is reset and the circuit closes. On a probe failure, the circuit reopens and the 60-second recovery timer resets.
Named breakers
Section titled “Named breakers”The Outpost maintains a per-dependency registry of breakers. Each dependency (policy_sync, audit_sync, etc.) has its own isolated breaker so a Platform audit endpoint outage does not block policy syncs.
The overall degradation state is True when any named breaker is OPEN.
Health check exposure
Section titled “Health check exposure”The /admin/status endpoint (port 8301) reports circuit breaker state:
{ "status": "warn", "degradation_mode": true, "degraded_since": "2026-03-12T14:22:00Z", "checks": [ { "name": "policy_sync", "status": "warn", "message": "circuit open" }, { "name": "audit_sync", "status": "pass", "message": "ok" } ]}Graceful degradation mode
Section titled “Graceful degradation mode”When at least one circuit breaker enters OPEN state, the Outpost enters degradation mode.
What changes in degradation mode
Section titled “What changes in degradation mode”| Feature | Normal operation | Degradation mode |
|---|---|---|
| Policy enforcement | Live policy bundle | Cached policy bundle (last successful sync) |
| Audit events | Sync to Platform in real time | Queued to local SQLite |
| Policy sync | Polled every POLICY_SYNC_INTERVAL_SECONDS | Blocked (circuit open) |
| AI proxy traffic | Continues normally | Continues normally |
/readyz probe | 200 OK | 200 OK (Outpost is still ready) |
The Outpost intentionally keeps serving traffic in degradation mode. Policy decisions use the last cached bundle — a stale bundle is preferable to blocking all AI requests.
Policy bundle cache
Section titled “Policy bundle cache”The cached policy bundle is held in memory. On startup the Outpost loads the bundle from a local cache file before attempting a live sync, so cold restarts during a Platform outage are handled gracefully.
The bundle includes a freshness guard: bundles older than 1 hour (BUNDLE_MAX_AGE_SECONDS = 3600) are rejected and the Outpost will not serve traffic until a fresh bundle is loaded.
Degradation log messages
Section titled “Degradation log messages”WARN outpost.degradation — Outpost entering degradation mode — reason=circuit breaker opened degraded_since=1741789320.123INFO outpost.degradation — Outpost exiting degradation mode — platform reachable again (was degraded for 187.4s)Monitor for entering degradation mode log lines in your SIEM or Grafana log dashboard.
Local SQLite audit queue
Section titled “Local SQLite audit queue”When Platform audit sync fails, events are buffered to a local SQLite database. This prevents audit data loss during outages.
Queue schema
Section titled “Queue schema”CREATE TABLE IF NOT EXISTS local_audit_queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, event_json TEXT NOT NULL, queued_at REAL NOT NULL, -- Unix timestamp retry_count INTEGER NOT NULL DEFAULT 0);Queue behaviour
Section titled “Queue behaviour”- Enqueue: Each audit event is serialised to JSON and inserted with
queued_at = now(). - Dequeue: The sync worker fetches events oldest-first in batches of 50 when the circuit closes.
- Mark synced: Rows are deleted after confirmed delivery to Platform.
- Retry count: Incremented on each failed flush attempt. High retry counts indicate persistent sync issues.
Queue location
Section titled “Queue location”The SQLite database path is configured via AUDIT_QUEUE_DB_PATH (default: data/audit_queue.db). In Kubernetes deployments, this path should be on a persistent volume so the queue survives pod restarts.
# Helm values: ensure audit queue survives pod restartsoutpost: persistence: enabled: true mountPath: /app/data env: AUDIT_QUEUE_DB_PATH: /app/data/audit_queue.dbAudit queue management endpoints
Section titled “Audit queue management endpoints”The admin API (port 8301) exposes three endpoints for managing the audit queue. All require a valid Authorization: Bearer <admin-key> header.
GET /admin/audit-queue/status
Section titled “GET /admin/audit-queue/status”Returns current queue statistics.
curl -s http://localhost:8301/admin/audit-queue/status \ -H "Authorization: Bearer $OUTPOST_ADMIN_KEY"Response:
{ "count": 142, "oldest_queued_at": "2026-03-12T14:22:00Z", "disk_usage_bytes": 49152, "degradation_mode": true}| Field | Description |
|---|---|
count | Number of events currently queued |
oldest_queued_at | ISO-8601 timestamp of the oldest unsynced event |
disk_usage_bytes | Size of the SQLite database file |
degradation_mode | Whether the Outpost is currently in degraded mode |
POST /admin/audit-queue/flush
Section titled “POST /admin/audit-queue/flush”Triggers an immediate flush of all queued events to Platform. Returns 409 if the queue is empty (not degraded).
curl -s -X POST http://localhost:8301/admin/audit-queue/flush \ -H "Authorization: Bearer $OUTPOST_ADMIN_KEY"Response (success):
{ "flushed": 142, "failed": 0, "duration_seconds": 1.24}Use this endpoint after confirming Platform connectivity is restored to accelerate queue draining without waiting for the next sync interval.
DELETE /admin/audit-queue/purge
Section titled “DELETE /admin/audit-queue/purge”⚠ Destructive. Permanently deletes all events from the local queue. Purged events are not sent to Platform and will be absent from the audit log.
Requires the X-Purge-Confirm: yes header to prevent accidental data loss.
curl -s -X DELETE http://localhost:8301/admin/audit-queue/purge \ -H "Authorization: Bearer $OUTPOST_ADMIN_KEY" \ -H "X-Purge-Confirm: yes"Response:
{ "purged": 142 }Only purge if you have confirmed that the queued events are corrupt or irrelevant (e.g., after a test environment reset). For production incidents, always prefer flush over purge.
Operational guidance
Section titled “Operational guidance”When degradation mode is expected
Section titled “When degradation mode is expected”| Scenario | Expected duration | Recommended action |
|---|---|---|
| Platform rolling restart | < 2 minutes | Monitor; auto-recovers |
| Platform maintenance window | Minutes–hours | Pre-notify Outpost operators; queue will flush on recovery |
| Network partition | Unknown | Monitor queue size; alert if queue > 10,000 events |
| Platform total outage | Hours+ | Escalate; consider purge only if events are non-compliant noise |
Alerting
Section titled “Alerting”Set a Prometheus alert on:
- alert: OutpostDegradedMode expr: outpost_platform_reachable == 0 for: 5m labels: severity: warning annotations: summary: "Outpost {{ $labels.outpost_id }} is in degraded mode"
- alert: OutpostAuditQueueHigh expr: outpost_audit_queue_depth > 1000 for: 10m labels: severity: warning annotations: summary: "Outpost audit queue depth is high ({{ $value }} events)"Checking status via admin UI
Section titled “Checking status via admin UI”Navigate to http://localhost:8301 in a browser (requires kubectl port-forward in Kubernetes). The Health panel shows:
- Platform reachability status
- Degradation mode indicator and duration
- Audit queue depth with a flush button
Related docs
Section titled “Related docs”- Outpost deployment architecture — component overview, ports, health probes
- Outpost software updates — update process and rollback
- Outpost OTel configuration — tracing and metrics
- Outpost admin API reference — full admin endpoint reference