Multi-Outpost Deployment

Multiple Outpost instances can be deployed in a cluster to provide horizontal scale and fault tolerance. Cluster-aware health aggregation (shipping in outpost-0038) gives the Cloud Portal and external load balancers a single health view across all peers in the cluster.

Architecture Overview

Each Outpost node in a cluster operates independently: it holds its own DLP model, policy bundle, and audit queue. Cluster membership is established through peer discovery. Once peers are discovered, each node:

Polls its peers’ /health endpoints on a configurable interval.
Aggregates the peer health states into a cluster-level health object.
Exposes the aggregated view on its own /cluster/health endpoint.
Applies a per-peer circuit breaker to prevent cascading failures when a peer is unreachable.

The Cloud Portal’s heartbeat mechanism treats the cluster health endpoint as the authoritative health signal when OUTPOST_CLUSTER_MODE=true.

Peer Discovery

Static Configuration

Peers are configured via OUTPOST_PEERS — a comma-separated list of peer base URLs:

OUTPOST_PEERS=https://outpost-1.internal:8443,https://outpost-2.internal:8443,https://outpost-3.internal:8443

Each entry must include protocol and port. The node does not include itself in the peer list.

Kubernetes Service Discovery

In Kubernetes deployments, use headless Services to enumerate pod IPs dynamically. Set OUTPOST_PEER_DISCOVERY=dns and provide the headless service DNS name:

OUTPOST_PEER_DISCOVERY=dns
OUTPOST_PEER_DISCOVERY_DNS=outpost-headless.arbitex.svc.cluster.local
OUTPOST_PEER_DISCOVERY_PORT=8443

On each discovery interval (OUTPOST_PEER_DISCOVERY_INTERVAL_S, default 60), the node resolves the DNS name and updates its peer list. Pods that disappear from DNS are removed from the peer list and their circuit breakers are reset.

Discovery Interval

OUTPOST_PEER_POLL_INTERVAL_S=15    # How often to poll each peer's /health endpoint (default: 15s)
OUTPOST_PEER_DISCOVERY_INTERVAL_S=60   # How often to re-resolve DNS peers (default: 60s)

Cluster Health Endpoint

GET /cluster/health

Returns the aggregated health of the cluster as observed by this node.

Authentication: None required (same as /health; restrict via network policy in production).

Request

curl https://outpost-1.internal:8443/cluster/health

Response

{
  "cluster_status": "healthy",
  "node_id": "outpost-1",
  "node_status": "healthy",
  "peer_count": 2,
  "healthy_peers": 2,
  "degraded_peers": 0,
  "unreachable_peers": 0,
  "peers": [
    {
      "node_id": "outpost-2",
      "url": "https://outpost-2.internal:8443",
      "status": "healthy",
      "last_checked": "2026-03-13T21:00:05Z",
      "latency_ms": 3,
      "circuit_breaker": "closed"
    },
    {
      "node_id": "outpost-3",
      "url": "https://outpost-3.internal:8443",
      "status": "healthy",
      "last_checked": "2026-03-13T21:00:06Z",
      "latency_ms": 5,
      "circuit_breaker": "closed"
    }
  ],
  "evaluated_at": "2026-03-13T21:00:06Z"
}

cluster_status Values

Value	Meaning
`healthy`	This node is healthy; all peers healthy
`degraded`	This node is healthy; one or more peers degraded or unreachable
`unhealthy`	This node itself is unhealthy (DLP model, policy bundle, or DB failure)

Field Reference

Field	Type	Description
`cluster_status`	string	Aggregate cluster health from this node’s perspective
`node_id`	string	Identifier of this node (from `OUTPOST_NODE_ID` env var)
`node_status`	string	Health of this node: `healthy`, `degraded`, `unhealthy`
`peer_count`	integer	Total configured peers
`healthy_peers`	integer	Peers currently in healthy state
`degraded_peers`	integer	Peers currently in degraded state
`unreachable_peers`	integer	Peers with open circuit breaker
`peers[]`	array	Per-peer health detail
`peers[].node_id`	string	Peer’s self-reported node ID
`peers[].url`	string	Peer’s base URL
`peers[].status`	string	Last observed status from peer’s own `/health`
`peers[].last_checked`	ISO 8601	Timestamp of last successful poll
`peers[].latency_ms`	integer	Round-trip latency of last poll
`peers[].circuit_breaker`	string	Circuit breaker state: `closed`, `open`, `half_open`
`evaluated_at`	ISO 8601	Timestamp this response was constructed

Circuit Breaker Per Peer

Each peer has an independent circuit breaker. This prevents a slow or unresponsive peer from degrading health-poll latency across the entire cluster.

States

State	Description
`closed`	Normal operation. Health polls proceed.
`open`	Peer is considered unreachable. Polls suspended. Peer is counted in `unreachable_peers`.
`half_open`	Recovery probe. One poll is allowed through to test if peer has recovered.

Thresholds

Parameter	Default	Env Var
Failures to open	3 consecutive	`OUTPOST_CB_FAILURE_THRESHOLD`
Open duration before half-open	60 seconds	`OUTPOST_CB_RESET_TIMEOUT_S`
Successes in half-open to close	1	`OUTPOST_CB_SUCCESS_THRESHOLD`

State transitions:

closed  →(3 consecutive failures)→  open
open    →(60s elapsed)→             half_open
half_open →(1 success)→             closed
half_open →(any failure)→           open

Circuit Breaker Behavior

When a circuit breaker is open:

The peer is excluded from health polling until the reset timeout elapses.
The peer’s status field in /cluster/health is reported as its last known status (stale), and circuit_breaker: "open" is set.
The peer is counted in unreachable_peers, which contributes to cluster_status: "degraded".
Traffic routing (if Arbitex is used behind a load balancer) is unaffected — the circuit breaker applies only to health monitoring polls, not to inference traffic.

Degradation States

The following table describes how different failure patterns affect cluster_status and expected behavior:

Scenario	`cluster_status`	Behavior
All peers healthy	`healthy`	Full operation
1 of 3 peers unreachable (CB open)	`degraded`	Full operation; 1 peer excluded from health view
1 of 3 peers reports `degraded`	`degraded`	Full operation; degraded peer still routes traffic
This node’s DLP model fails to load	`unhealthy`	This node returns 503 on inference endpoints
This node’s policy bundle is stale > threshold	`degraded`	This node warns but continues serving
All peers unreachable	`degraded`	This node continues serving from its own state

Key principle: The cluster health endpoint reflects the observability plane. Peer health polling does not gate inference routing. A node with cluster_status: "degraded" continues to process requests normally — the degraded status is a monitoring signal, not a service interruption.

Node-Level Degradation

A node enters node_status: "degraded" when:

Policy bundle age exceeds OUTPOST_POLICY_STALE_THRESHOLD_S (default: 3600 — 1 hour)
GeoIP MMDB is absent and the GeoIP subsystem is disabled (logged but non-fatal)
Peer discovery DNS resolution has failed for more than OUTPOST_PEER_DISCOVERY_STALE_S seconds

A node enters node_status: "unhealthy" when:

DLP model (/models/deberta-v3-base/) fails to load or inference returns errors
Policy bundle is missing (not stale — completely absent)
Database (SQLite for audit queue) is unavailable and OUTPOST_AUDIT_FAIL_MODE=closed

Kubernetes Deployment Pattern

A reference cluster deployment uses a StatefulSet for stable network identities:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: outpost
  namespace: arbitex
spec:
  serviceName: outpost-headless
  replicas: 3
  selector:
    matchLabels:
      app: outpost
  template:
    metadata:
      labels:
        app: outpost
    spec:
      containers:
        - name: outpost
          image: arbitex/outpost:latest
          env:
            - name: OUTPOST_NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name      # outpost-0, outpost-1, outpost-2
            - name: OUTPOST_CLUSTER_MODE
              value: "true"
            - name: OUTPOST_PEER_DISCOVERY
              value: "dns"
            - name: OUTPOST_PEER_DISCOVERY_DNS
              value: "outpost-headless.arbitex.svc.cluster.local"
            - name: OUTPOST_PEER_DISCOVERY_PORT
              value: "8443"
            - name: OUTPOST_CB_FAILURE_THRESHOLD
              value: "3"
            - name: OUTPOST_CB_RESET_TIMEOUT_S
              value: "60"
          ports:
            - containerPort: 8443
          readinessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 10
            periodSeconds: 15
          livenessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 30
            periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: outpost-headless
  namespace: arbitex
spec:
  clusterIP: None
  selector:
    app: outpost
  ports:
    - port: 8443
      targetPort: 8443

The StatefulSet gives each pod a stable DNS name (outpost-0.outpost-headless, outpost-1.outpost-headless, etc.). Peer discovery resolves outpost-headless.arbitex.svc.cluster.local to all pod IPs.

Cloud Portal Integration

When OUTPOST_CLUSTER_MODE=true, configure the heartbeat URL to point at the cluster health endpoint:

CLOUD_HEARTBEAT_URL=https://api.arbitex.ai/api/admin/outpost/heartbeat
OUTPOST_CLUSTER_HEALTH_PATH=/cluster/health

The Cloud Portal heartbeat receiver accepts cluster health payloads and displays per-node status in Admin → Outposts → [Outpost] → Cluster Health.

The cluster_status field maps to portal indicators:

`cluster_status`	Portal indicator
`healthy`	Green — all nodes healthy
`degraded`	Amber — one or more peers degraded/unreachable
`unhealthy`	Red — this node is failing

Multi-Outpost Deployment

Multi-Outpost Deployment

Architecture Overview

Peer Discovery

Static Configuration

Kubernetes Service Discovery

Discovery Interval

Cluster Health Endpoint

GET /cluster/health

Request

Response

cluster_status Values

Field Reference

Circuit Breaker Per Peer

States

Thresholds

Circuit Breaker Behavior

Degradation States

Node-Level Degradation

Kubernetes Deployment Pattern

Cloud Portal Integration

Related Guides