Outpost Health Monitoring
Outpost Health Monitoring
Section titled “Outpost Health Monitoring”Arbitex Hybrid Outpost reports health to the management plane via a dual-track heartbeat system. Every deployed outpost sends periodic heartbeats both to the Platform management plane (policy sync channel) and to the Cloud Portal (operational dashboard). This document explains the architecture, configuration, and how to use the portal dashboard.
Heartbeat Architecture
Section titled “Heartbeat Architecture”Two heartbeat channels
Section titled “Two heartbeat channels”┌─────────────────────────────────────────────────────────┐│ Hybrid Outpost ││ ││ HeartbeatSender ││ ├── Platform heartbeat ──────────────────────────────► Platform management plane│ │ POST /v1/orgs/{org_id}/outposts/{outpost_id}/heartbeat ││ │ Auth: mTLS (same cert as policy sync) ││ │ Interval: 120s (with backoff on failure) ││ │ ││ └── Enhanced heartbeat ───────────────────────────────► Cloud Portal│ POST {CLOUD_HEARTBEAT_URL}/v1/outpost/heartbeat ││ Auth: mTLS preferred; Bearer token fallback ││ Interval: CLOUD_HEARTBEAT_INTERVAL (default 60s) │└─────────────────────────────────────────────────────────┘Platform heartbeat carries the operational state used for policy sync decisions: version, uptime, policy sync status, DLP tier 3 activation, and pending audit event count.
Enhanced heartbeat carries the richer telemetry displayed in the Cloud Portal dashboard: DLP tier list, certificate expiry, resource usage (CPU/memory/disk).
The enhanced heartbeat fires after every platform heartbeat attempt, regardless of whether the platform heartbeat succeeded.
Backoff on failure
Section titled “Backoff on failure”When platform heartbeats fail (network errors, timeouts, HTTP non-2xx), the outpost applies exponential backoff:
delay = min(120s × 2^(failures−1), 900s) × ±10% jitter- First failure: 120s
- Second failure: 240s
- …capped at 900 seconds (15 minutes)
On the next successful heartbeat the interval resets to the base 120s. Jitter (±10%) prevents thundering-herd reconnection when many outposts recover simultaneously.
Heartbeat Payload Fields
Section titled “Heartbeat Payload Fields”Platform management plane payload
Section titled “Platform management plane payload”Sent to POST /v1/orgs/{org_id}/outposts/{outpost_id}/heartbeat:
| Field | Type | Description |
|---|---|---|
version | string | Outpost software version (e.g. 0.1.0) |
uptime | int | Seconds since the outpost process started |
policy_version | string | Version hash of the currently active policy bundle |
last_sync_at | ISO-8601 string | null | Timestamp of the most recent successful policy sync |
dlp_model_version | string | DeBERTa ONNX model identifier, or none if Tier 3 is inactive |
pending_audit_events | int | Approximate count of unsynced audit events (capped at 100); -1 = error reading count |
tier3_active | bool | Whether DeBERTa (Tier 3) contextual DLP is currently loaded and available |
Cloud Portal enhanced payload
Section titled “Cloud Portal enhanced payload”Sent to POST {CLOUD_HEARTBEAT_URL}/v1/outpost/heartbeat:
| Field | Type | Description |
|---|---|---|
outpost_id | UUID | Outpost identifier |
version | string | Outpost software version |
uptime_seconds | int | Seconds since the outpost process started |
last_policy_sync | ISO-8601 string | null | Timestamp of the most recent successful policy sync |
dlp_tiers_active | string[] | Active DLP tiers: subset of ["regex", "ner", "deberta", "credint"] |
cert_expiry | ISO-8601 string | null | mTLS certificate expiry date; null if unreadable |
resource_usage | object | CPU/memory/disk percentages: {cpu_percent, memory_percent, disk_percent} |
Platform acknowledgement
Section titled “Platform acknowledgement”The Platform management plane responds to a successful heartbeat with HTTP 200 and optionally a JSON body containing latest_version. If the outpost is running an outdated version:
{"latest_version": "0.2.0"}The outpost logs a warning: Outpost version outdated: running=0.1.0 latest=0.2.0 — update recommended. No automatic action is taken; the operator must deploy the update.
Configuration
Section titled “Configuration”Set these environment variables on the outpost:
| Variable | Required | Default | Description |
|---|---|---|---|
CLOUD_HEARTBEAT_URL | No | "" | Base URL of the Cloud Portal heartbeat receiver. When empty, enhanced heartbeats are silently skipped. |
CLOUD_HEARTBEAT_INTERVAL | No | 60 | Interval in seconds between enhanced heartbeats. |
OUTPOST_CERT_PATH | Yes (production) | certs/outpost.pem | Path to the outpost mTLS client certificate. |
OUTPOST_KEY_PATH | Yes (production) | certs/outpost.key | Path to the outpost mTLS private key. |
OUTPOST_CA_PATH | Yes (production) | certs/ca.pem | Path to the Platform CA certificate for server verification. |
PLATFORM_MANAGEMENT_URL | Yes | "" | Platform management plane base URL. Heartbeats are skipped if empty. |
OUTPOST_ID | Yes | "" | Outpost UUID from the Cloud Portal registration. |
ORG_ID | Yes | "" | Organisation UUID. Required for the heartbeat URL path. |
Note: The platform heartbeat interval is hardcoded at 120 seconds and is not configurable via environment variable. The CLOUD_HEARTBEAT_INTERVAL setting applies only to the enhanced (Cloud Portal) heartbeat channel.
Cloud Portal Dashboard
Section titled “Cloud Portal Dashboard”Outpost list (/portal/outposts)
Section titled “Outpost list (/portal/outposts)”The Outposts page in the Cloud Portal shows all registered outposts for the organisation. Each row shows:
- Outpost name and region
- Last heartbeat timestamp
- Status badge (green / amber / red — see thresholds below)
- Software version and whether an update is available
- Active DLP tiers
- Certificate expiry date (with warning when < 30 days remaining)
Status thresholds
Section titled “Status thresholds”| Colour | Condition | Meaning |
|---|---|---|
| Green (healthy) | Heartbeat received within the last 5 minutes | Outpost is operating normally |
| Amber (stale) | Last heartbeat 5–30 minutes ago | Outpost may be experiencing connectivity issues or is under high backoff |
| Red (offline) | No heartbeat for > 30 minutes, or deregistered status | Outpost is unreachable or deregistered |
Heartbeat history
Section titled “Heartbeat history”Navigate to an individual outpost and click Heartbeat History to view the last 50 heartbeat records (paginated, newest first). The admin API behind this view:
GET /v1/admin/outposts/{outpost_id}/heartbeats?limit=50&offset=0Authorization: X-API-Key <admin-key>Each record includes: received_at, status, version, uptime_seconds, policy_version, last_sync_at, dlp_tiers_active, cert_expiry, resource_usage.
All-org outpost list (multi-org admins)
Section titled “All-org outpost list (multi-org admins)”GET /v1/admin/outpostsAuthorization: X-API-Key <admin-key>Returns all outposts across all organisations, ordered by most-recently-seen first.
Troubleshooting
Section titled “Troubleshooting”Outpost shows amber/red but is running
Section titled “Outpost shows amber/red but is running”Cause: Heartbeats are reaching the outpost process but not getting through to the Platform or Cloud Portal.
Checks:
- Firewall rules. The outpost must be able to make outbound HTTPS connections to
PLATFORM_MANAGEMENT_URLandCLOUD_HEARTBEAT_URL. Verify there is no egress firewall blocking TCP 443. - mTLS certificate validity. The outpost will refuse to send heartbeats if
OUTPOST_CERT_PATH,OUTPOST_KEY_PATH, orOUTPOST_CA_PATHare missing. Check outpost logs formTLS certificates required but missing. - Backoff state. After repeated failures the outpost may be sleeping for up to 15 minutes between attempts. Check logs for
Heartbeat backoff — sleeping Xs (failure=N). Wait for the next attempt or restart the outpost to reset the backoff counter. - Proxy/load balancer. If the outpost connects via a forward proxy, confirm the proxy allows connections to both the Platform management plane and the Cloud Portal.
Stale status in portal after outpost restart
Section titled “Stale status in portal after outpost restart”The Cloud Portal status is derived from last_heartbeat_at. After restart there is a 120-second window before the first platform heartbeat and up to CLOUD_HEARTBEAT_INTERVAL seconds before the first enhanced heartbeat. The status will update automatically once the first heartbeat is received.
Missed heartbeats after policy sync disruption
Section titled “Missed heartbeats after policy sync disruption”The heartbeat sender is independent of the policy sync client. A failed policy sync does not prevent heartbeats from being sent. If heartbeats are missing while policy sync is also failing, the root cause is likely a network connectivity issue or an expired mTLS certificate.
Certificate expiry warning in portal
Section titled “Certificate expiry warning in portal”When cert_expiry in the heartbeat is within 30 days, the portal shows a warning badge on the outpost row. Renew the outpost certificate before it expires:
# Via Cloud admin APIPOST /v1/orgs/{org_id}/outposts/{outpost_id}/renewAuthorization: X-API-Key <admin-key>The renewed certificate bundle (cert + key + CA) must be deployed to the outpost’s OUTPOST_CERT_PATH and OUTPOST_KEY_PATH. The outpost process picks up the new cert on next mTLS client creation (next heartbeat cycle after the files are replaced in-place).
Resource usage fields missing from history
Section titled “Resource usage fields missing from history”resource_usage is populated by the psutil library. If psutil is not installed in the outpost container image, resource fields are omitted from the enhanced heartbeat payload. This does not affect other heartbeat functionality. Install psutil to enable CPU/memory/disk reporting:
RUN pip install psutil