Multi-Outpost Deployment
Multi-Outpost Deployment
Section titled “Multi-Outpost Deployment”Multiple Outpost instances can be deployed in a cluster to provide horizontal scale and fault tolerance. Cluster-aware health aggregation (shipping in outpost-0038) gives the Cloud Portal and external load balancers a single health view across all peers in the cluster.
Architecture Overview
Section titled “Architecture Overview”Each Outpost node in a cluster operates independently: it holds its own DLP model, policy bundle, and audit queue. Cluster membership is established through peer discovery. Once peers are discovered, each node:
- Polls its peers’
/healthendpoints on a configurable interval. - Aggregates the peer health states into a cluster-level health object.
- Exposes the aggregated view on its own
/cluster/healthendpoint. - Applies a per-peer circuit breaker to prevent cascading failures when a peer is unreachable.
The Cloud Portal’s heartbeat mechanism treats the cluster health endpoint as the authoritative health signal when OUTPOST_CLUSTER_MODE=true.
Peer Discovery
Section titled “Peer Discovery”Static Configuration
Section titled “Static Configuration”Peers are configured via OUTPOST_PEERS — a comma-separated list of peer base URLs:
OUTPOST_PEERS=https://outpost-1.internal:8443,https://outpost-2.internal:8443,https://outpost-3.internal:8443Each entry must include protocol and port. The node does not include itself in the peer list.
Kubernetes Service Discovery
Section titled “Kubernetes Service Discovery”In Kubernetes deployments, use headless Services to enumerate pod IPs dynamically. Set OUTPOST_PEER_DISCOVERY=dns and provide the headless service DNS name:
OUTPOST_PEER_DISCOVERY=dnsOUTPOST_PEER_DISCOVERY_DNS=outpost-headless.arbitex.svc.cluster.localOUTPOST_PEER_DISCOVERY_PORT=8443On each discovery interval (OUTPOST_PEER_DISCOVERY_INTERVAL_S, default 60), the node resolves the DNS name and updates its peer list. Pods that disappear from DNS are removed from the peer list and their circuit breakers are reset.
Discovery Interval
Section titled “Discovery Interval”OUTPOST_PEER_POLL_INTERVAL_S=15 # How often to poll each peer's /health endpoint (default: 15s)OUTPOST_PEER_DISCOVERY_INTERVAL_S=60 # How often to re-resolve DNS peers (default: 60s)Cluster Health Endpoint
Section titled “Cluster Health Endpoint”GET /cluster/health
Section titled “GET /cluster/health”Returns the aggregated health of the cluster as observed by this node.
Authentication: None required (same as /health; restrict via network policy in production).
Request
Section titled “Request”curl https://outpost-1.internal:8443/cluster/healthResponse
Section titled “Response”{ "cluster_status": "healthy", "node_id": "outpost-1", "node_status": "healthy", "peer_count": 2, "healthy_peers": 2, "degraded_peers": 0, "unreachable_peers": 0, "peers": [ { "node_id": "outpost-2", "url": "https://outpost-2.internal:8443", "status": "healthy", "last_checked": "2026-03-13T21:00:05Z", "latency_ms": 3, "circuit_breaker": "closed" }, { "node_id": "outpost-3", "url": "https://outpost-3.internal:8443", "status": "healthy", "last_checked": "2026-03-13T21:00:06Z", "latency_ms": 5, "circuit_breaker": "closed" } ], "evaluated_at": "2026-03-13T21:00:06Z"}cluster_status Values
Section titled “cluster_status Values”| Value | Meaning |
|---|---|
healthy | This node is healthy; all peers healthy |
degraded | This node is healthy; one or more peers degraded or unreachable |
unhealthy | This node itself is unhealthy (DLP model, policy bundle, or DB failure) |
Field Reference
Section titled “Field Reference”| Field | Type | Description |
|---|---|---|
cluster_status | string | Aggregate cluster health from this node’s perspective |
node_id | string | Identifier of this node (from OUTPOST_NODE_ID env var) |
node_status | string | Health of this node: healthy, degraded, unhealthy |
peer_count | integer | Total configured peers |
healthy_peers | integer | Peers currently in healthy state |
degraded_peers | integer | Peers currently in degraded state |
unreachable_peers | integer | Peers with open circuit breaker |
peers[] | array | Per-peer health detail |
peers[].node_id | string | Peer’s self-reported node ID |
peers[].url | string | Peer’s base URL |
peers[].status | string | Last observed status from peer’s own /health |
peers[].last_checked | ISO 8601 | Timestamp of last successful poll |
peers[].latency_ms | integer | Round-trip latency of last poll |
peers[].circuit_breaker | string | Circuit breaker state: closed, open, half_open |
evaluated_at | ISO 8601 | Timestamp this response was constructed |
Circuit Breaker Per Peer
Section titled “Circuit Breaker Per Peer”Each peer has an independent circuit breaker. This prevents a slow or unresponsive peer from degrading health-poll latency across the entire cluster.
States
Section titled “States”| State | Description |
|---|---|
closed | Normal operation. Health polls proceed. |
open | Peer is considered unreachable. Polls suspended. Peer is counted in unreachable_peers. |
half_open | Recovery probe. One poll is allowed through to test if peer has recovered. |
Thresholds
Section titled “Thresholds”| Parameter | Default | Env Var |
|---|---|---|
| Failures to open | 3 consecutive | OUTPOST_CB_FAILURE_THRESHOLD |
| Open duration before half-open | 60 seconds | OUTPOST_CB_RESET_TIMEOUT_S |
| Successes in half-open to close | 1 | OUTPOST_CB_SUCCESS_THRESHOLD |
State transitions:
closed →(3 consecutive failures)→ openopen →(60s elapsed)→ half_openhalf_open →(1 success)→ closedhalf_open →(any failure)→ openCircuit Breaker Behavior
Section titled “Circuit Breaker Behavior”When a circuit breaker is open:
- The peer is excluded from health polling until the reset timeout elapses.
- The peer’s
statusfield in/cluster/healthis reported as its last known status (stale), andcircuit_breaker: "open"is set. - The peer is counted in
unreachable_peers, which contributes tocluster_status: "degraded". - Traffic routing (if Arbitex is used behind a load balancer) is unaffected — the circuit breaker applies only to health monitoring polls, not to inference traffic.
Degradation States
Section titled “Degradation States”The following table describes how different failure patterns affect cluster_status and expected behavior:
| Scenario | cluster_status | Behavior |
|---|---|---|
| All peers healthy | healthy | Full operation |
| 1 of 3 peers unreachable (CB open) | degraded | Full operation; 1 peer excluded from health view |
1 of 3 peers reports degraded | degraded | Full operation; degraded peer still routes traffic |
| This node’s DLP model fails to load | unhealthy | This node returns 503 on inference endpoints |
| This node’s policy bundle is stale > threshold | degraded | This node warns but continues serving |
| All peers unreachable | degraded | This node continues serving from its own state |
Key principle: The cluster health endpoint reflects the observability plane. Peer health polling does not gate inference routing. A node with cluster_status: "degraded" continues to process requests normally — the degraded status is a monitoring signal, not a service interruption.
Node-Level Degradation
Section titled “Node-Level Degradation”A node enters node_status: "degraded" when:
- Policy bundle age exceeds
OUTPOST_POLICY_STALE_THRESHOLD_S(default:3600— 1 hour) - GeoIP MMDB is absent and the GeoIP subsystem is disabled (logged but non-fatal)
- Peer discovery DNS resolution has failed for more than
OUTPOST_PEER_DISCOVERY_STALE_Sseconds
A node enters node_status: "unhealthy" when:
- DLP model (
/models/deberta-v3-base/) fails to load or inference returns errors - Policy bundle is missing (not stale — completely absent)
- Database (SQLite for audit queue) is unavailable and
OUTPOST_AUDIT_FAIL_MODE=closed
Kubernetes Deployment Pattern
Section titled “Kubernetes Deployment Pattern”A reference cluster deployment uses a StatefulSet for stable network identities:
apiVersion: apps/v1kind: StatefulSetmetadata: name: outpost namespace: arbitexspec: serviceName: outpost-headless replicas: 3 selector: matchLabels: app: outpost template: metadata: labels: app: outpost spec: containers: - name: outpost image: arbitex/outpost:latest env: - name: OUTPOST_NODE_ID valueFrom: fieldRef: fieldPath: metadata.name # outpost-0, outpost-1, outpost-2 - name: OUTPOST_CLUSTER_MODE value: "true" - name: OUTPOST_PEER_DISCOVERY value: "dns" - name: OUTPOST_PEER_DISCOVERY_DNS value: "outpost-headless.arbitex.svc.cluster.local" - name: OUTPOST_PEER_DISCOVERY_PORT value: "8443" - name: OUTPOST_CB_FAILURE_THRESHOLD value: "3" - name: OUTPOST_CB_RESET_TIMEOUT_S value: "60" ports: - containerPort: 8443 readinessProbe: httpGet: path: /health port: 8443 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 15 livenessProbe: httpGet: path: /health port: 8443 scheme: HTTPS initialDelaySeconds: 30 periodSeconds: 30---apiVersion: v1kind: Servicemetadata: name: outpost-headless namespace: arbitexspec: clusterIP: None selector: app: outpost ports: - port: 8443 targetPort: 8443The StatefulSet gives each pod a stable DNS name (outpost-0.outpost-headless, outpost-1.outpost-headless, etc.). Peer discovery resolves outpost-headless.arbitex.svc.cluster.local to all pod IPs.
Cloud Portal Integration
Section titled “Cloud Portal Integration”When OUTPOST_CLUSTER_MODE=true, configure the heartbeat URL to point at the cluster health endpoint:
CLOUD_HEARTBEAT_URL=https://api.arbitex.ai/api/admin/outpost/heartbeatOUTPOST_CLUSTER_HEALTH_PATH=/cluster/healthThe Cloud Portal heartbeat receiver accepts cluster health payloads and displays per-node status in Admin → Outposts → [Outpost] → Cluster Health.
The cluster_status field maps to portal indicators:
cluster_status | Portal indicator |
|---|---|
healthy | Green — all nodes healthy |
degraded | Amber — one or more peers degraded/unreachable |
unhealthy | Red — this node is failing |