API Reference: System Health
API Reference: System Health
Section titled “API Reference: System Health”Base path: /api/health and /api/admin/system
System health endpoints expose liveness/readiness probes for Kubernetes, detailed component status for operations dashboards, dependency health checks, and version metadata.
Liveness Probe
Section titled “Liveness Probe”Returns 200 OK if the platform process is alive. Used by Kubernetes livenessProbe.
GET /api/health/liveNo authentication required.
Response 200 OK:
{"status": "ok"}Response 503 Service Unavailable (process in a bad state — will be restarted by kubelet):
{"status": "unhealthy", "reason": "deadlock_detected"}Readiness Probe
Section titled “Readiness Probe”Returns 200 OK if the platform is ready to serve traffic (all critical dependencies are reachable). Used by Kubernetes readinessProbe.
GET /api/health/readyNo authentication required.
Response 200 OK:
{"status": "ready"}Response 503 Service Unavailable (one or more critical dependencies unavailable):
{ "status": "not_ready", "failing_checks": ["database", "redis"]}Traffic is not routed to pods returning 503 on this endpoint.
Startup Probe
Section titled “Startup Probe”Returns 200 OK once the platform has completed initialization (migrations applied, caches warmed). Used by Kubernetes startupProbe.
GET /api/health/startupNo authentication required.
Response 200 OK:
{ "status": "started", "startup_duration_ms": 4230}Response 503 during initialization:
{ "status": "starting", "phase": "applying_migrations", "elapsed_ms": 1200}Detailed Health Status
Section titled “Detailed Health Status”Full component health status for operations dashboards. Requires admin token.
GET /api/admin/system/healthAuthorization: Bearer <admin-token>Response 200 OK:
{ "status": "healthy", "timestamp": "2026-03-12T10:00:00Z", "version": "1.14.2", "components": { "database": { "status": "healthy", "latency_ms": 2.1, "pool": { "total": 20, "active": 4, "idle": 16, "waiting": 0 } }, "redis": { "status": "healthy", "latency_ms": 0.4, "memory_used_mb": 128, "memory_peak_mb": 256, "connected_clients": 12 }, "ner_service": { "status": "healthy", "latency_ms": 18, "queue_depth": 0, "workers": { "running": 2, "total": 2 } }, "deberta_service": { "status": "healthy", "latency_ms": 145, "queue_depth": 3, "workers": { "running": 1, "total": 1 } }, "celery_workers": { "status": "healthy", "active_tasks": 2, "reserved_tasks": 5, "worker_count": 3 }, "providers": { "anthropic": { "status": "healthy", "latency_ms": 320, "last_checked": "2026-03-12T09:59:30Z" }, "openai": { "status": "degraded", "latency_ms": 2100, "last_checked": "2026-03-12T09:59:30Z", "detail": "Elevated latency detected" }, "azure_openai": { "status": "healthy", "latency_ms": 410, "last_checked": "2026-03-12T09:59:30Z" } } }}Overall status values:
| Value | Description |
|---|---|
healthy | All components operating normally |
degraded | One or more components degraded (non-critical) |
unhealthy | Critical component failure |
Component status values: healthy | degraded | unhealthy | unknown
Component Health Details
Section titled “Component Health Details”Get health status for a specific component:
GET /api/admin/system/health/components/{component}Authorization: Bearer <admin-token>Available components: database, redis, ner_service, deberta_service, celery_workers, providers
Response 200 OK: Component-specific health object from the full health response.
Provider Connectivity Check
Section titled “Provider Connectivity Check”Test connectivity to a specific AI provider:
POST /api/admin/system/check-provider/{provider_id}Authorization: Bearer <admin-token>Sends a minimal test request to the provider to verify credentials and connectivity:
{ "provider_id": "anthropic", "status": "reachable", "latency_ms": 312, "model_tested": "claude-3-haiku-20240307", "checked_at": "2026-03-12T10:00:00Z"}Error response when provider is unreachable:
{ "provider_id": "openai", "status": "unreachable", "error": "connection_timeout", "detail": "Request to api.openai.com timed out after 5000ms", "checked_at": "2026-03-12T10:00:00Z"}System Version
Section titled “System Version”GET /api/admin/system/versionAuthorization: Bearer <admin-token>{ "platform_version": "1.14.2", "api_version": "v1", "build_date": "2026-03-10T08:00:00Z", "git_commit": "a3f8d2c", "environment": "production", "schema_version": "87", "feature_flags": { "dlp_enabled": true, "budget_enforcement": true, "outpost_enabled": true }}System Metrics Summary
Section titled “System Metrics Summary”High-level metrics for an operations overview. For full metrics, use the Prometheus endpoint.
GET /api/admin/system/metrics-summaryAuthorization: Bearer <admin-token>{ "period": "last_1h", "requests": { "total": 48293, "success": 47801, "error": 492, "error_rate": 0.0102 }, "latency": { "p50_ms": 210, "p95_ms": 820, "p99_ms": 1540 }, "dlp": { "requests_scanned": 48293, "triggers": { "block": 12, "redact": 89, "flag": 203, "require_approval": 5 } }, "active_connections": 42, "cache_hit_rate": 0.234}Maintenance Mode
Section titled “Maintenance Mode”Put the platform in maintenance mode to drain connections before upgrades.
Enable Maintenance Mode
Section titled “Enable Maintenance Mode”POST /api/admin/system/maintenanceAuthorization: Bearer <admin-token>Content-Type: application/json
{ "enabled": true, "message": "Scheduled maintenance — back in 30 minutes", "drain_timeout_seconds": 60}Response 200 OK:
{ "maintenance_enabled": true, "enabled_at": "2026-03-12T10:00:00Z", "message": "Scheduled maintenance — back in 30 minutes", "active_requests_drained": true}While in maintenance mode, new requests receive:
HTTP/1.1 503 Service UnavailableRetry-After: 1800Content-Type: application/json
{ "error": "maintenance_mode", "message": "Scheduled maintenance — back in 30 minutes"}Health probes (/api/health/ready) return 503 during maintenance, so Kubernetes removes the pod from load balancer rotation.
Disable Maintenance Mode
Section titled “Disable Maintenance Mode”POST /api/admin/system/maintenanceAuthorization: Bearer <admin-token>Content-Type: application/json
{"enabled": false}Prometheus Metrics Endpoint
Section titled “Prometheus Metrics Endpoint”The platform exposes a standard Prometheus metrics endpoint. This endpoint is unauthenticated but typically network-restricted to the monitoring subnet.
GET /metricsKey metric families:
| Metric | Type | Description |
|---|---|---|
arbitex_http_requests_total | counter | HTTP requests by method, path, status |
arbitex_http_request_duration_seconds | histogram | Request latency |
arbitex_dlp_pipeline_requests_total | counter | DLP pipeline invocations |
arbitex_dlp_pipeline_duration_seconds | histogram | DLP pipeline duration |
arbitex_dlp_triggers_total | counter | DLP trigger events by action |
arbitex_db_pool_available_connections | gauge | Available DB connections |
arbitex_db_pool_max_connections | gauge | Max DB pool size |
arbitex_group_spend_current_month_usd | gauge | Per-group MTD spend |
arbitex_group_budget_limit_usd | gauge | Per-group budget limit |
arbitex_outpost_mtls_cert_expiry_timestamp_seconds | gauge | Outpost cert expiry |
arbitex_mfa_verifications_total | counter | MFA verifications by status |
arbitex_deberta_queue_depth | gauge | DeBERTa inference queue depth |
Scrape configuration for Prometheus:
scrape_configs: - job_name: arbitex-platform static_configs: - targets: ['arbitex-platform:9090'] scrape_interval: 15s metrics_path: /metricsFor Kubernetes, use a ServiceMonitor (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: arbitex-platform namespace: monitoringspec: selector: matchLabels: app: arbitex-platform endpoints: - port: metrics interval: 15sKubernetes Probe Configuration
Section titled “Kubernetes Probe Configuration”Recommended probe settings for platform deployments:
# In platform Pod speclivenessProbe: httpGet: path: /api/health/live port: 8080 initialDelaySeconds: 10 periodSeconds: 10 failureThreshold: 3
readinessProbe: httpGet: path: /api/health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3
startupProbe: httpGet: path: /api/health/startup port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 24 # Allow up to 2 min for startupError Reference
Section titled “Error Reference”| Status | Code | Description |
|---|---|---|
401 | unauthorized | Missing or invalid token |
403 | forbidden | Token lacks admin scope |
404 | not_found | Component name not recognized |
503 | unhealthy | Platform is in a degraded state |