Outpost Reliability Features
Outpost Reliability Features
Section titled “Outpost Reliability Features”Arbitex Outpost is designed to keep serving AI requests even when connectivity to the Arbitex platform is degraded or lost. This guide covers the reliability features shipped in outpost sprints 0029 through 0032: circuit breakers, graceful degradation, local audit queue, certificate rotation, air-gap operation, and the admin CLI.
Circuit Breaker
Section titled “Circuit Breaker”The circuit breaker prevents an offline or overloaded platform from causing request storms. It is applied to every outbound platform API call: policy sync, heartbeat, and audit log sync.
States
Section titled “States”CLOSED ──(failure threshold)──▶ OPEN ──(recovery timeout)──▶ HALF_OPEN ──(probe success)──▶ CLOSED ▲ │ └──────────────(probe failure)──────────────┘| State | Behavior |
|---|---|
| CLOSED | Normal operation — all calls pass through |
| OPEN | All calls fail immediately with CircuitOpenError (no network attempt) |
| HALF_OPEN | One probe call is allowed through; success → CLOSED, failure → OPEN |
Defaults
Section titled “Defaults”| Parameter | Default | Description |
|---|---|---|
failure_threshold | 5 | Consecutive failures to trip CLOSED → OPEN |
recovery_timeout_seconds | 60 | Seconds in OPEN before entering HALF_OPEN |
half_open_max_calls | 1 | Probe calls allowed in HALF_OPEN before decision |
State transitions are logged at INFO level. All circuit breaker instances are tracked in CircuitBreakerRegistry (singleton).
Three platform API call sites use circuit breakers:
| Client | What it protects |
|---|---|
PolicySyncClient | Policy rules download from platform |
HeartbeatSender | Outpost → platform keepalive |
AuditSyncWorker | Audit event upload to platform |
Each has its own named circuit breaker. If only audit sync is failing, the audit breaker opens independently — policy sync and heartbeats continue unaffected.
Graceful Degradation
Section titled “Graceful Degradation”When any circuit breaker is open, the outpost enters degradation mode. In this mode:
- Requests continue to be served using cached policy rules.
- Audit events are queued locally to SQLite instead of being uploaded immediately.
- Policy sync returns the last successfully downloaded policy instead of failing.
The DegradationManager (outpost/degradation.py) monitors circuit breaker state and transitions:
- Enter: Logged at WARNING level with timestamp.
- Exit: Logged at INFO level with degradation duration.
The admin health endpoint reports degradation state:
{ "status": "degraded", "degradation_mode": true, "degraded_since": "2026-03-12T14:23:01Z"}Local Audit Queue
Section titled “Local Audit Queue”When the outpost is degraded, audit events are written to a SQLite queue (audit_queue/audit_queue.db) instead of being sent to the platform.
Queue behavior
Section titled “Queue behavior”| Operation | Description |
|---|---|
enqueue | Append event to SQLite; thread-safe via threading.Lock |
dequeue_batch | Fetch N oldest events for sync attempt |
mark_synced | Remove successfully uploaded events |
increment_retry | Increment retry counter on upload failure |
purge | Delete all queued events (requires admin confirmation) |
Events are retried in FIFO order on recovery. The queue is persistent across outpost restarts — events survive process crashes or rolling restarts.
Sync-on-recovery
Section titled “Sync-on-recovery”When circuit breakers return to CLOSED (the platform becomes reachable again), AuditSyncWorker._flush_local_queue() automatically drains queued events to the platform before resuming normal sync operation.
Configuration
Section titled “Configuration”AUDIT_QUEUE_DB_PATH=audit_queue/audit_queue.db # Path to SQLite queue fileThe default path is relative to the outpost working directory. Use an absolute path for production deployments.
Admin endpoints
Section titled “Admin endpoints”| Endpoint | Method | Description |
|---|---|---|
/admin/audit-queue/status | GET | Queue depth, oldest event timestamp, disk usage |
/admin/audit-queue/flush | POST | Trigger immediate sync attempt (409 if queue empty) |
/admin/audit-queue/purge | DELETE | Delete all queued events (requires X-Purge-Confirm: yes header) |
Certificate Rotation
Section titled “Certificate Rotation”The outpost supports automatic certificate bundle download from the platform at startup, as well as manual cert rotation.
Auto-download at startup
Section titled “Auto-download at startup”When CERT_BUNDLE_AUTO_DOWNLOAD=true, the outpost calls CertBundleDownloader.download_and_install() at startup:
- Fetches the cert bundle from the platform using mTLS.
- Verifies the RSA or EC certificate chain (leaf → intermediate → CA).
- Writes
leaf.pem,intermediate.pem, andca.pemto the configured cert directory. - Fails open on any error — if download fails, the outpost starts with existing certs.
Rotation lifecycle
Section titled “Rotation lifecycle”Old cert bundle New cert bundle─────────────────────────────────────────────────────Startup: download+install File overwrite (atomic)Chain verify: leaf→int→CA Verified before writeOn failure: Old certs preservedConfiguration
Section titled “Configuration”CERT_BUNDLE_AUTO_DOWNLOAD=false # Enable cert bundle download at startup (default: false)CERT_DIR=/path/to/certs # Directory for leaf.pem / intermediate.pem / ca.pemValidation
Section titled “Validation”The outpost validate-config CLI checks cert files at startup:
| Condition | Result |
|---|---|
| Certificate expired | [FAIL] — exit code 1 |
| Certificate expires in < 30 days | [WARN] — continues |
| Certificate valid | [PASS] |
| Missing cert file | No crash — logged |
Air-Gap Mode Operation
Section titled “Air-Gap Mode Operation”Air-gap mode allows the outpost to run in isolated networks with no connectivity to the Arbitex platform. All platform-dependent sync operations are disabled; the outpost runs entirely from a local bundle.
What is disabled in air-gap mode
Section titled “What is disabled in air-gap mode”| Feature | Behavior |
|---|---|
| Policy sync | Skipped — loads from local AirGapLoader bundle |
| Heartbeat | Skipped |
| Audit sync to platform | Skipped — events remain in local queue |
| Cert bundle auto-download | Skipped (guard prevents attempt) |
Bundle loading
Section titled “Bundle loading”Air-gap bundles are loaded by AirGapLoader. A missing bundle is handled gracefully — the outpost logs a warning and continues with last-known-good policy or an empty policy.
Enabling air-gap mode
Section titled “Enabling air-gap mode”AIR_GAP=true # Enable air-gap operationWhen AIR_GAP=true:
- All HTTP calls to platform URLs are suppressed at the call site (guard checks before attempt, not at network layer).
- The circuit breaker is bypassed — there is no “circuit open” in air-gap mode since no calls are attempted.
Admin CLI
Section titled “Admin CLI”The outpost CLI provides operational commands for checking status and validating configuration without restarting the process.
Installation
Section titled “Installation”The CLI is installed as an entry point via pyproject.toml:
outpost statusoutpost validate-config# orpython -m outpost.cli statuspython -m outpost.cli validate-configoutpost status
Section titled “outpost status”Queries the outpost admin API and displays current operational state.
outpost status [--json] [--admin-port PORT] [--admin-key KEY]Human-readable output:
Outpost Status Status: healthy Circuit open: false Degradation: false Policy age: 42s Queue depth: 0 Audit sync: 2026-03-12T14:20:00Z
Circuit Breakers: policy_sync CLOSED heartbeat CLOSED audit_sync CLOSEDJSON output (--json):
{ "status": "healthy", "circuit_breakers": { "policy_sync": "CLOSED", "heartbeat": "CLOSED", "audit_sync": "CLOSED" }, "degradation_mode": false, "audit_queue_depth": 0, "last_sync": "2026-03-12T14:20:00Z"}Exit codes:
| Code | Condition |
|---|---|
0 | Healthy |
1 | Connection error, auth failure (401), server error (500), or timeout |
outpost validate-config
Section titled “outpost validate-config”Runs all startup configuration checks and reports results. Does not require the outpost to be running.
outpost validate-config [--json]Output:
[PASS] mTLS client certificate: valid (expires 2027-01-15)[PASS] mTLS CA certificate: valid (expires 2028-06-30)[WARN] mTLS intermediate: expires in 22 days (2026-04-03)[PASS] GeoIP MMDB: valid (MaxMind magic bytes present)[PASS] Platform URL: reachable[FAIL] Audit queue directory: not writableChecks performed:
| Check | Pass condition | Fail condition |
|---|---|---|
| mTLS cert expiry | Valid and > 30 days | Expired |
| mTLS cert expiry (warning) | Valid and > 30 days | < 30 days remaining |
| GeoIP MMDB | Magic bytes present (\xab\xcd\xef) | Absent (WARN) |
| Platform URL reachability | HTTP 200 or 401 | Connection error |
| Audit queue directory | Writable | Not writable |
Exit code 1 if any [FAIL] check; exit 0 otherwise (warnings do not affect exit code).
Prometheus Metrics
Section titled “Prometheus Metrics”The outpost exposes the following reliability-related metrics at /metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
outpost_circuit_breaker_open | Gauge | breaker=<name> | 1 if circuit open, 0 if closed |
outpost_degradation_mode | Gauge | — | 1 if degraded, 0 if normal |
outpost_audit_queue_depth | Gauge | — | Number of events in local SQLite queue |
outpost_requests_total | Counter | status=success|error | Total requests handled |
outpost_policy_sync_age_seconds | Gauge | — | Seconds since last successful policy sync |
These metrics are refreshed on every GET /metrics request.
Related Documentation
Section titled “Related Documentation”- Outpost Deployment — installation and initial configuration
- Outpost JWT Validation — authentication setup
- Distributed Tracing — OTel configuration for outpost telemetry
- Audit Log Management — understanding audit events