Skip to content

Multi-Outpost Deployment

Multiple Outpost instances can be deployed in a cluster to provide horizontal scale and fault tolerance. Cluster-aware health aggregation (shipping in outpost-0038) gives the Cloud Portal and external load balancers a single health view across all peers in the cluster.


Each Outpost node in a cluster operates independently: it holds its own DLP model, policy bundle, and audit queue. Cluster membership is established through peer discovery. Once peers are discovered, each node:

  1. Polls its peers’ /health endpoints on a configurable interval.
  2. Aggregates the peer health states into a cluster-level health object.
  3. Exposes the aggregated view on its own /cluster/health endpoint.
  4. Applies a per-peer circuit breaker to prevent cascading failures when a peer is unreachable.

The Cloud Portal’s heartbeat mechanism treats the cluster health endpoint as the authoritative health signal when OUTPOST_CLUSTER_MODE=true.


Peers are configured via OUTPOST_PEERS — a comma-separated list of peer base URLs:

Terminal window
OUTPOST_PEERS=https://outpost-1.internal:8443,https://outpost-2.internal:8443,https://outpost-3.internal:8443

Each entry must include protocol and port. The node does not include itself in the peer list.

In Kubernetes deployments, use headless Services to enumerate pod IPs dynamically. Set OUTPOST_PEER_DISCOVERY=dns and provide the headless service DNS name:

Terminal window
OUTPOST_PEER_DISCOVERY=dns
OUTPOST_PEER_DISCOVERY_DNS=outpost-headless.arbitex.svc.cluster.local
OUTPOST_PEER_DISCOVERY_PORT=8443

On each discovery interval (OUTPOST_PEER_DISCOVERY_INTERVAL_S, default 60), the node resolves the DNS name and updates its peer list. Pods that disappear from DNS are removed from the peer list and their circuit breakers are reset.

Terminal window
OUTPOST_PEER_POLL_INTERVAL_S=15 # How often to poll each peer's /health endpoint (default: 15s)
OUTPOST_PEER_DISCOVERY_INTERVAL_S=60 # How often to re-resolve DNS peers (default: 60s)

Returns the aggregated health of the cluster as observed by this node.

Authentication: None required (same as /health; restrict via network policy in production).

Terminal window
curl https://outpost-1.internal:8443/cluster/health
{
"cluster_status": "healthy",
"node_id": "outpost-1",
"node_status": "healthy",
"peer_count": 2,
"healthy_peers": 2,
"degraded_peers": 0,
"unreachable_peers": 0,
"peers": [
{
"node_id": "outpost-2",
"url": "https://outpost-2.internal:8443",
"status": "healthy",
"last_checked": "2026-03-13T21:00:05Z",
"latency_ms": 3,
"circuit_breaker": "closed"
},
{
"node_id": "outpost-3",
"url": "https://outpost-3.internal:8443",
"status": "healthy",
"last_checked": "2026-03-13T21:00:06Z",
"latency_ms": 5,
"circuit_breaker": "closed"
}
],
"evaluated_at": "2026-03-13T21:00:06Z"
}
ValueMeaning
healthyThis node is healthy; all peers healthy
degradedThis node is healthy; one or more peers degraded or unreachable
unhealthyThis node itself is unhealthy (DLP model, policy bundle, or DB failure)
FieldTypeDescription
cluster_statusstringAggregate cluster health from this node’s perspective
node_idstringIdentifier of this node (from OUTPOST_NODE_ID env var)
node_statusstringHealth of this node: healthy, degraded, unhealthy
peer_countintegerTotal configured peers
healthy_peersintegerPeers currently in healthy state
degraded_peersintegerPeers currently in degraded state
unreachable_peersintegerPeers with open circuit breaker
peers[]arrayPer-peer health detail
peers[].node_idstringPeer’s self-reported node ID
peers[].urlstringPeer’s base URL
peers[].statusstringLast observed status from peer’s own /health
peers[].last_checkedISO 8601Timestamp of last successful poll
peers[].latency_msintegerRound-trip latency of last poll
peers[].circuit_breakerstringCircuit breaker state: closed, open, half_open
evaluated_atISO 8601Timestamp this response was constructed

Each peer has an independent circuit breaker. This prevents a slow or unresponsive peer from degrading health-poll latency across the entire cluster.

StateDescription
closedNormal operation. Health polls proceed.
openPeer is considered unreachable. Polls suspended. Peer is counted in unreachable_peers.
half_openRecovery probe. One poll is allowed through to test if peer has recovered.
ParameterDefaultEnv Var
Failures to open3 consecutiveOUTPOST_CB_FAILURE_THRESHOLD
Open duration before half-open60 secondsOUTPOST_CB_RESET_TIMEOUT_S
Successes in half-open to close1OUTPOST_CB_SUCCESS_THRESHOLD

State transitions:

closed →(3 consecutive failures)→ open
open →(60s elapsed)→ half_open
half_open →(1 success)→ closed
half_open →(any failure)→ open

When a circuit breaker is open:

  • The peer is excluded from health polling until the reset timeout elapses.
  • The peer’s status field in /cluster/health is reported as its last known status (stale), and circuit_breaker: "open" is set.
  • The peer is counted in unreachable_peers, which contributes to cluster_status: "degraded".
  • Traffic routing (if Arbitex is used behind a load balancer) is unaffected — the circuit breaker applies only to health monitoring polls, not to inference traffic.

The following table describes how different failure patterns affect cluster_status and expected behavior:

Scenariocluster_statusBehavior
All peers healthyhealthyFull operation
1 of 3 peers unreachable (CB open)degradedFull operation; 1 peer excluded from health view
1 of 3 peers reports degradeddegradedFull operation; degraded peer still routes traffic
This node’s DLP model fails to loadunhealthyThis node returns 503 on inference endpoints
This node’s policy bundle is stale > thresholddegradedThis node warns but continues serving
All peers unreachabledegradedThis node continues serving from its own state

Key principle: The cluster health endpoint reflects the observability plane. Peer health polling does not gate inference routing. A node with cluster_status: "degraded" continues to process requests normally — the degraded status is a monitoring signal, not a service interruption.

A node enters node_status: "degraded" when:

  • Policy bundle age exceeds OUTPOST_POLICY_STALE_THRESHOLD_S (default: 3600 — 1 hour)
  • GeoIP MMDB is absent and the GeoIP subsystem is disabled (logged but non-fatal)
  • Peer discovery DNS resolution has failed for more than OUTPOST_PEER_DISCOVERY_STALE_S seconds

A node enters node_status: "unhealthy" when:

  • DLP model (/models/deberta-v3-base/) fails to load or inference returns errors
  • Policy bundle is missing (not stale — completely absent)
  • Database (SQLite for audit queue) is unavailable and OUTPOST_AUDIT_FAIL_MODE=closed

A reference cluster deployment uses a StatefulSet for stable network identities:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: outpost
namespace: arbitex
spec:
serviceName: outpost-headless
replicas: 3
selector:
matchLabels:
app: outpost
template:
metadata:
labels:
app: outpost
spec:
containers:
- name: outpost
image: arbitex/outpost:latest
env:
- name: OUTPOST_NODE_ID
valueFrom:
fieldRef:
fieldPath: metadata.name # outpost-0, outpost-1, outpost-2
- name: OUTPOST_CLUSTER_MODE
value: "true"
- name: OUTPOST_PEER_DISCOVERY
value: "dns"
- name: OUTPOST_PEER_DISCOVERY_DNS
value: "outpost-headless.arbitex.svc.cluster.local"
- name: OUTPOST_PEER_DISCOVERY_PORT
value: "8443"
- name: OUTPOST_CB_FAILURE_THRESHOLD
value: "3"
- name: OUTPOST_CB_RESET_TIMEOUT_S
value: "60"
ports:
- containerPort: 8443
readinessProbe:
httpGet:
path: /health
port: 8443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 15
livenessProbe:
httpGet:
path: /health
port: 8443
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: outpost-headless
namespace: arbitex
spec:
clusterIP: None
selector:
app: outpost
ports:
- port: 8443
targetPort: 8443

The StatefulSet gives each pod a stable DNS name (outpost-0.outpost-headless, outpost-1.outpost-headless, etc.). Peer discovery resolves outpost-headless.arbitex.svc.cluster.local to all pod IPs.


When OUTPOST_CLUSTER_MODE=true, configure the heartbeat URL to point at the cluster health endpoint:

Terminal window
CLOUD_HEARTBEAT_URL=https://api.arbitex.ai/api/admin/outpost/heartbeat
OUTPOST_CLUSTER_HEALTH_PATH=/cluster/health

The Cloud Portal heartbeat receiver accepts cluster health payloads and displays per-node status in Admin → Outposts → [Outpost] → Cluster Health.

The cluster_status field maps to portal indicators:

cluster_statusPortal indicator
healthyGreen — all nodes healthy
degradedAmber — one or more peers degraded/unreachable
unhealthyRed — this node is failing