Skip to content

Outpost Reliability Features

Arbitex Outpost is designed to keep serving AI requests even when connectivity to the Arbitex platform is degraded or lost. This guide covers the reliability features shipped in outpost sprints 0029 through 0032: circuit breakers, graceful degradation, local audit queue, certificate rotation, air-gap operation, and the admin CLI.


The circuit breaker prevents an offline or overloaded platform from causing request storms. It is applied to every outbound platform API call: policy sync, heartbeat, and audit log sync.

CLOSED ──(failure threshold)──▶ OPEN ──(recovery timeout)──▶ HALF_OPEN ──(probe success)──▶ CLOSED
▲ │
└──────────────(probe failure)──────────────┘
StateBehavior
CLOSEDNormal operation — all calls pass through
OPENAll calls fail immediately with CircuitOpenError (no network attempt)
HALF_OPENOne probe call is allowed through; success → CLOSED, failure → OPEN
ParameterDefaultDescription
failure_threshold5Consecutive failures to trip CLOSED → OPEN
recovery_timeout_seconds60Seconds in OPEN before entering HALF_OPEN
half_open_max_calls1Probe calls allowed in HALF_OPEN before decision

State transitions are logged at INFO level. All circuit breaker instances are tracked in CircuitBreakerRegistry (singleton).

Three platform API call sites use circuit breakers:

ClientWhat it protects
PolicySyncClientPolicy rules download from platform
HeartbeatSenderOutpost → platform keepalive
AuditSyncWorkerAudit event upload to platform

Each has its own named circuit breaker. If only audit sync is failing, the audit breaker opens independently — policy sync and heartbeats continue unaffected.


When any circuit breaker is open, the outpost enters degradation mode. In this mode:

  • Requests continue to be served using cached policy rules.
  • Audit events are queued locally to SQLite instead of being uploaded immediately.
  • Policy sync returns the last successfully downloaded policy instead of failing.

The DegradationManager (outpost/degradation.py) monitors circuit breaker state and transitions:

  • Enter: Logged at WARNING level with timestamp.
  • Exit: Logged at INFO level with degradation duration.

The admin health endpoint reports degradation state:

{
"status": "degraded",
"degradation_mode": true,
"degraded_since": "2026-03-12T14:23:01Z"
}

When the outpost is degraded, audit events are written to a SQLite queue (audit_queue/audit_queue.db) instead of being sent to the platform.

OperationDescription
enqueueAppend event to SQLite; thread-safe via threading.Lock
dequeue_batchFetch N oldest events for sync attempt
mark_syncedRemove successfully uploaded events
increment_retryIncrement retry counter on upload failure
purgeDelete all queued events (requires admin confirmation)

Events are retried in FIFO order on recovery. The queue is persistent across outpost restarts — events survive process crashes or rolling restarts.

When circuit breakers return to CLOSED (the platform becomes reachable again), AuditSyncWorker._flush_local_queue() automatically drains queued events to the platform before resuming normal sync operation.

Terminal window
AUDIT_QUEUE_DB_PATH=audit_queue/audit_queue.db # Path to SQLite queue file

The default path is relative to the outpost working directory. Use an absolute path for production deployments.

EndpointMethodDescription
/admin/audit-queue/statusGETQueue depth, oldest event timestamp, disk usage
/admin/audit-queue/flushPOSTTrigger immediate sync attempt (409 if queue empty)
/admin/audit-queue/purgeDELETEDelete all queued events (requires X-Purge-Confirm: yes header)

The outpost supports automatic certificate bundle download from the platform at startup, as well as manual cert rotation.

When CERT_BUNDLE_AUTO_DOWNLOAD=true, the outpost calls CertBundleDownloader.download_and_install() at startup:

  1. Fetches the cert bundle from the platform using mTLS.
  2. Verifies the RSA or EC certificate chain (leaf → intermediate → CA).
  3. Writes leaf.pem, intermediate.pem, and ca.pem to the configured cert directory.
  4. Fails open on any error — if download fails, the outpost starts with existing certs.
Old cert bundle New cert bundle
─────────────────────────────────────────────────────
Startup: download+install File overwrite (atomic)
Chain verify: leaf→int→CA Verified before write
On failure: Old certs preserved
Terminal window
CERT_BUNDLE_AUTO_DOWNLOAD=false # Enable cert bundle download at startup (default: false)
CERT_DIR=/path/to/certs # Directory for leaf.pem / intermediate.pem / ca.pem

The outpost validate-config CLI checks cert files at startup:

ConditionResult
Certificate expired[FAIL] — exit code 1
Certificate expires in < 30 days[WARN] — continues
Certificate valid[PASS]
Missing cert fileNo crash — logged

Air-gap mode allows the outpost to run in isolated networks with no connectivity to the Arbitex platform. All platform-dependent sync operations are disabled; the outpost runs entirely from a local bundle.

FeatureBehavior
Policy syncSkipped — loads from local AirGapLoader bundle
HeartbeatSkipped
Audit sync to platformSkipped — events remain in local queue
Cert bundle auto-downloadSkipped (guard prevents attempt)

Air-gap bundles are loaded by AirGapLoader. A missing bundle is handled gracefully — the outpost logs a warning and continues with last-known-good policy or an empty policy.

Terminal window
AIR_GAP=true # Enable air-gap operation

When AIR_GAP=true:

  • All HTTP calls to platform URLs are suppressed at the call site (guard checks before attempt, not at network layer).
  • The circuit breaker is bypassed — there is no “circuit open” in air-gap mode since no calls are attempted.

The outpost CLI provides operational commands for checking status and validating configuration without restarting the process.

The CLI is installed as an entry point via pyproject.toml:

Terminal window
outpost status
outpost validate-config
# or
python -m outpost.cli status
python -m outpost.cli validate-config

Queries the outpost admin API and displays current operational state.

Terminal window
outpost status [--json] [--admin-port PORT] [--admin-key KEY]

Human-readable output:

Outpost Status
Status: healthy
Circuit open: false
Degradation: false
Policy age: 42s
Queue depth: 0
Audit sync: 2026-03-12T14:20:00Z
Circuit Breakers:
policy_sync CLOSED
heartbeat CLOSED
audit_sync CLOSED

JSON output (--json):

{
"status": "healthy",
"circuit_breakers": {
"policy_sync": "CLOSED",
"heartbeat": "CLOSED",
"audit_sync": "CLOSED"
},
"degradation_mode": false,
"audit_queue_depth": 0,
"last_sync": "2026-03-12T14:20:00Z"
}

Exit codes:

CodeCondition
0Healthy
1Connection error, auth failure (401), server error (500), or timeout

Runs all startup configuration checks and reports results. Does not require the outpost to be running.

Terminal window
outpost validate-config [--json]

Output:

[PASS] mTLS client certificate: valid (expires 2027-01-15)
[PASS] mTLS CA certificate: valid (expires 2028-06-30)
[WARN] mTLS intermediate: expires in 22 days (2026-04-03)
[PASS] GeoIP MMDB: valid (MaxMind magic bytes present)
[PASS] Platform URL: reachable
[FAIL] Audit queue directory: not writable

Checks performed:

CheckPass conditionFail condition
mTLS cert expiryValid and > 30 daysExpired
mTLS cert expiry (warning)Valid and > 30 days< 30 days remaining
GeoIP MMDBMagic bytes present (\xab\xcd\xef)Absent (WARN)
Platform URL reachabilityHTTP 200 or 401Connection error
Audit queue directoryWritableNot writable

Exit code 1 if any [FAIL] check; exit 0 otherwise (warnings do not affect exit code).


The outpost exposes the following reliability-related metrics at /metrics:

MetricTypeLabelsDescription
outpost_circuit_breaker_openGaugebreaker=<name>1 if circuit open, 0 if closed
outpost_degradation_modeGauge1 if degraded, 0 if normal
outpost_audit_queue_depthGaugeNumber of events in local SQLite queue
outpost_requests_totalCounterstatus=success|errorTotal requests handled
outpost_policy_sync_age_secondsGaugeSeconds since last successful policy sync

These metrics are refreshed on every GET /metrics request.