Skip to content

API Reference: System Health

Base path: /api/health and /api/admin/system

System health endpoints expose liveness/readiness probes for Kubernetes, detailed component status for operations dashboards, dependency health checks, and version metadata.


Returns 200 OK if the platform process is alive. Used by Kubernetes livenessProbe.

GET /api/health/live

No authentication required.

Response 200 OK:

{"status": "ok"}

Response 503 Service Unavailable (process in a bad state — will be restarted by kubelet):

{"status": "unhealthy", "reason": "deadlock_detected"}

Returns 200 OK if the platform is ready to serve traffic (all critical dependencies are reachable). Used by Kubernetes readinessProbe.

GET /api/health/ready

No authentication required.

Response 200 OK:

{"status": "ready"}

Response 503 Service Unavailable (one or more critical dependencies unavailable):

{
"status": "not_ready",
"failing_checks": ["database", "redis"]
}

Traffic is not routed to pods returning 503 on this endpoint.


Returns 200 OK once the platform has completed initialization (migrations applied, caches warmed). Used by Kubernetes startupProbe.

GET /api/health/startup

No authentication required.

Response 200 OK:

{
"status": "started",
"startup_duration_ms": 4230
}

Response 503 during initialization:

{
"status": "starting",
"phase": "applying_migrations",
"elapsed_ms": 1200
}

Full component health status for operations dashboards. Requires admin token.

GET /api/admin/system/health
Authorization: Bearer <admin-token>

Response 200 OK:

{
"status": "healthy",
"timestamp": "2026-03-12T10:00:00Z",
"version": "1.14.2",
"components": {
"database": {
"status": "healthy",
"latency_ms": 2.1,
"pool": {
"total": 20,
"active": 4,
"idle": 16,
"waiting": 0
}
},
"redis": {
"status": "healthy",
"latency_ms": 0.4,
"memory_used_mb": 128,
"memory_peak_mb": 256,
"connected_clients": 12
},
"ner_service": {
"status": "healthy",
"latency_ms": 18,
"queue_depth": 0,
"workers": {
"running": 2,
"total": 2
}
},
"deberta_service": {
"status": "healthy",
"latency_ms": 145,
"queue_depth": 3,
"workers": {
"running": 1,
"total": 1
}
},
"celery_workers": {
"status": "healthy",
"active_tasks": 2,
"reserved_tasks": 5,
"worker_count": 3
},
"providers": {
"anthropic": {
"status": "healthy",
"latency_ms": 320,
"last_checked": "2026-03-12T09:59:30Z"
},
"openai": {
"status": "degraded",
"latency_ms": 2100,
"last_checked": "2026-03-12T09:59:30Z",
"detail": "Elevated latency detected"
},
"azure_openai": {
"status": "healthy",
"latency_ms": 410,
"last_checked": "2026-03-12T09:59:30Z"
}
}
}
}

Overall status values:

ValueDescription
healthyAll components operating normally
degradedOne or more components degraded (non-critical)
unhealthyCritical component failure

Component status values: healthy | degraded | unhealthy | unknown


Get health status for a specific component:

GET /api/admin/system/health/components/{component}
Authorization: Bearer <admin-token>

Available components: database, redis, ner_service, deberta_service, celery_workers, providers

Response 200 OK: Component-specific health object from the full health response.


Test connectivity to a specific AI provider:

POST /api/admin/system/check-provider/{provider_id}
Authorization: Bearer <admin-token>

Sends a minimal test request to the provider to verify credentials and connectivity:

{
"provider_id": "anthropic",
"status": "reachable",
"latency_ms": 312,
"model_tested": "claude-3-haiku-20240307",
"checked_at": "2026-03-12T10:00:00Z"
}

Error response when provider is unreachable:

{
"provider_id": "openai",
"status": "unreachable",
"error": "connection_timeout",
"detail": "Request to api.openai.com timed out after 5000ms",
"checked_at": "2026-03-12T10:00:00Z"
}

GET /api/admin/system/version
Authorization: Bearer <admin-token>
{
"platform_version": "1.14.2",
"api_version": "v1",
"build_date": "2026-03-10T08:00:00Z",
"git_commit": "a3f8d2c",
"environment": "production",
"schema_version": "87",
"feature_flags": {
"dlp_enabled": true,
"budget_enforcement": true,
"outpost_enabled": true
}
}

High-level metrics for an operations overview. For full metrics, use the Prometheus endpoint.

GET /api/admin/system/metrics-summary
Authorization: Bearer <admin-token>
{
"period": "last_1h",
"requests": {
"total": 48293,
"success": 47801,
"error": 492,
"error_rate": 0.0102
},
"latency": {
"p50_ms": 210,
"p95_ms": 820,
"p99_ms": 1540
},
"dlp": {
"requests_scanned": 48293,
"triggers": {
"block": 12,
"redact": 89,
"flag": 203,
"require_approval": 5
}
},
"active_connections": 42,
"cache_hit_rate": 0.234
}

Put the platform in maintenance mode to drain connections before upgrades.

POST /api/admin/system/maintenance
Authorization: Bearer <admin-token>
Content-Type: application/json
{
"enabled": true,
"message": "Scheduled maintenance — back in 30 minutes",
"drain_timeout_seconds": 60
}

Response 200 OK:

{
"maintenance_enabled": true,
"enabled_at": "2026-03-12T10:00:00Z",
"message": "Scheduled maintenance — back in 30 minutes",
"active_requests_drained": true
}

While in maintenance mode, new requests receive:

HTTP/1.1 503 Service Unavailable
Retry-After: 1800
Content-Type: application/json
{
"error": "maintenance_mode",
"message": "Scheduled maintenance — back in 30 minutes"
}

Health probes (/api/health/ready) return 503 during maintenance, so Kubernetes removes the pod from load balancer rotation.

POST /api/admin/system/maintenance
Authorization: Bearer <admin-token>
Content-Type: application/json
{"enabled": false}

The platform exposes a standard Prometheus metrics endpoint. This endpoint is unauthenticated but typically network-restricted to the monitoring subnet.

GET /metrics

Key metric families:

MetricTypeDescription
arbitex_http_requests_totalcounterHTTP requests by method, path, status
arbitex_http_request_duration_secondshistogramRequest latency
arbitex_dlp_pipeline_requests_totalcounterDLP pipeline invocations
arbitex_dlp_pipeline_duration_secondshistogramDLP pipeline duration
arbitex_dlp_triggers_totalcounterDLP trigger events by action
arbitex_db_pool_available_connectionsgaugeAvailable DB connections
arbitex_db_pool_max_connectionsgaugeMax DB pool size
arbitex_group_spend_current_month_usdgaugePer-group MTD spend
arbitex_group_budget_limit_usdgaugePer-group budget limit
arbitex_outpost_mtls_cert_expiry_timestamp_secondsgaugeOutpost cert expiry
arbitex_mfa_verifications_totalcounterMFA verifications by status
arbitex_deberta_queue_depthgaugeDeBERTa inference queue depth

Scrape configuration for Prometheus:

scrape_configs:
- job_name: arbitex-platform
static_configs:
- targets: ['arbitex-platform:9090']
scrape_interval: 15s
metrics_path: /metrics

For Kubernetes, use a ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arbitex-platform
namespace: monitoring
spec:
selector:
matchLabels:
app: arbitex-platform
endpoints:
- port: metrics
interval: 15s

Recommended probe settings for platform deployments:

# In platform Pod spec
livenessProbe:
httpGet:
path: /api/health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /api/health/startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 24 # Allow up to 2 min for startup

StatusCodeDescription
401unauthorizedMissing or invalid token
403forbiddenToken lacks admin scope
404not_foundComponent name not recognized
503unhealthyPlatform is in a degraded state