API Reference: System Health

Base path: /api/health and /api/admin/system

System health endpoints expose liveness/readiness probes for Kubernetes, detailed component status for operations dashboards, dependency health checks, and version metadata.

Liveness Probe

Returns 200 OK if the platform process is alive. Used by Kubernetes livenessProbe.

GET /api/health/live

No authentication required.

Response 200 OK:

{"status": "ok"}

Response 503 Service Unavailable (process in a bad state — will be restarted by kubelet):

{"status": "unhealthy", "reason": "deadlock_detected"}

Readiness Probe

Returns 200 OK if the platform is ready to serve traffic (all critical dependencies are reachable). Used by Kubernetes readinessProbe.

GET /api/health/ready

No authentication required.

Response 200 OK:

{"status": "ready"}

Response 503 Service Unavailable (one or more critical dependencies unavailable):

{
  "status": "not_ready",
  "failing_checks": ["database", "redis"]
}

Traffic is not routed to pods returning 503 on this endpoint.

Startup Probe

Returns 200 OK once the platform has completed initialization (migrations applied, caches warmed). Used by Kubernetes startupProbe.

GET /api/health/startup

No authentication required.

Response 200 OK:

{
  "status": "started",
  "startup_duration_ms": 4230
}

Response 503 during initialization:

{
  "status": "starting",
  "phase": "applying_migrations",
  "elapsed_ms": 1200
}

Detailed Health Status

Full component health status for operations dashboards. Requires admin token.

GET /api/admin/system/health
Authorization: Bearer <admin-token>

Response 200 OK:

{
  "status": "healthy",
  "timestamp": "2026-03-12T10:00:00Z",
  "version": "1.14.2",
  "components": {
    "database": {
      "status": "healthy",
      "latency_ms": 2.1,
      "pool": {
        "total": 20,
        "active": 4,
        "idle": 16,
        "waiting": 0
      }
    },
    "redis": {
      "status": "healthy",
      "latency_ms": 0.4,
      "memory_used_mb": 128,
      "memory_peak_mb": 256,
      "connected_clients": 12
    },
    "ner_service": {
      "status": "healthy",
      "latency_ms": 18,
      "queue_depth": 0,
      "workers": {
        "running": 2,
        "total": 2
      }
    },
    "deberta_service": {
      "status": "healthy",
      "latency_ms": 145,
      "queue_depth": 3,
      "workers": {
        "running": 1,
        "total": 1
      }
    },
    "celery_workers": {
      "status": "healthy",
      "active_tasks": 2,
      "reserved_tasks": 5,
      "worker_count": 3
    },
    "providers": {
      "anthropic": {
        "status": "healthy",
        "latency_ms": 320,
        "last_checked": "2026-03-12T09:59:30Z"
      },
      "openai": {
        "status": "degraded",
        "latency_ms": 2100,
        "last_checked": "2026-03-12T09:59:30Z",
        "detail": "Elevated latency detected"
      },
      "azure_openai": {
        "status": "healthy",
        "latency_ms": 410,
        "last_checked": "2026-03-12T09:59:30Z"
      }
    }
  }
}

Overall status values:

Value	Description
`healthy`	All components operating normally
`degraded`	One or more components degraded (non-critical)
`unhealthy`	Critical component failure

Component status values: healthy | degraded | unhealthy | unknown

Component Health Details

Get health status for a specific component:

GET /api/admin/system/health/components/{component}
Authorization: Bearer <admin-token>

Available components: database, redis, ner_service, deberta_service, celery_workers, providers

Response 200 OK: Component-specific health object from the full health response.

Provider Connectivity Check

Test connectivity to a specific AI provider:

POST /api/admin/system/check-provider/{provider_id}
Authorization: Bearer <admin-token>

Sends a minimal test request to the provider to verify credentials and connectivity:

{
  "provider_id": "anthropic",
  "status": "reachable",
  "latency_ms": 312,
  "model_tested": "claude-3-haiku-20240307",
  "checked_at": "2026-03-12T10:00:00Z"
}

Error response when provider is unreachable:

{
  "provider_id": "openai",
  "status": "unreachable",
  "error": "connection_timeout",
  "detail": "Request to api.openai.com timed out after 5000ms",
  "checked_at": "2026-03-12T10:00:00Z"
}

System Version

GET /api/admin/system/version
Authorization: Bearer <admin-token>

{
  "platform_version": "1.14.2",
  "api_version": "v1",
  "build_date": "2026-03-10T08:00:00Z",
  "git_commit": "a3f8d2c",
  "environment": "production",
  "schema_version": "87",
  "feature_flags": {
    "dlp_enabled": true,
    "budget_enforcement": true,
    "outpost_enabled": true
  }
}

System Metrics Summary

High-level metrics for an operations overview. For full metrics, use the Prometheus endpoint.

GET /api/admin/system/metrics-summary
Authorization: Bearer <admin-token>

{
  "period": "last_1h",
  "requests": {
    "total": 48293,
    "success": 47801,
    "error": 492,
    "error_rate": 0.0102
  },
  "latency": {
    "p50_ms": 210,
    "p95_ms": 820,
    "p99_ms": 1540
  },
  "dlp": {
    "requests_scanned": 48293,
    "triggers": {
      "block": 12,
      "redact": 89,
      "flag": 203,
      "require_approval": 5
    }
  },
  "active_connections": 42,
  "cache_hit_rate": 0.234
}

Maintenance Mode

Put the platform in maintenance mode to drain connections before upgrades.

Enable Maintenance Mode

POST /api/admin/system/maintenance
Authorization: Bearer <admin-token>
Content-Type: application/json

{
  "enabled": true,
  "message": "Scheduled maintenance — back in 30 minutes",
  "drain_timeout_seconds": 60
}

Response 200 OK:

{
  "maintenance_enabled": true,
  "enabled_at": "2026-03-12T10:00:00Z",
  "message": "Scheduled maintenance — back in 30 minutes",
  "active_requests_drained": true
}

While in maintenance mode, new requests receive:

HTTP/1.1 503 Service Unavailable
Retry-After: 1800
Content-Type: application/json

{
  "error": "maintenance_mode",
  "message": "Scheduled maintenance — back in 30 minutes"
}

Health probes (/api/health/ready) return 503 during maintenance, so Kubernetes removes the pod from load balancer rotation.

Disable Maintenance Mode

POST /api/admin/system/maintenance
Authorization: Bearer <admin-token>
Content-Type: application/json

{"enabled": false}

Prometheus Metrics Endpoint

The platform exposes a standard Prometheus metrics endpoint. This endpoint is unauthenticated but typically network-restricted to the monitoring subnet.

GET /metrics

Key metric families:

Metric	Type	Description
`arbitex_http_requests_total`	counter	HTTP requests by method, path, status
`arbitex_http_request_duration_seconds`	histogram	Request latency
`arbitex_dlp_pipeline_requests_total`	counter	DLP pipeline invocations
`arbitex_dlp_pipeline_duration_seconds`	histogram	DLP pipeline duration
`arbitex_dlp_triggers_total`	counter	DLP trigger events by action
`arbitex_db_pool_available_connections`	gauge	Available DB connections
`arbitex_db_pool_max_connections`	gauge	Max DB pool size
`arbitex_group_spend_current_month_usd`	gauge	Per-group MTD spend
`arbitex_group_budget_limit_usd`	gauge	Per-group budget limit
`arbitex_outpost_mtls_cert_expiry_timestamp_seconds`	gauge	Outpost cert expiry
`arbitex_mfa_verifications_total`	counter	MFA verifications by status
`arbitex_deberta_queue_depth`	gauge	DeBERTa inference queue depth

Scrape configuration for Prometheus:

scrape_configs:
  - job_name: arbitex-platform
    static_configs:
      - targets: ['arbitex-platform:9090']
    scrape_interval: 15s
    metrics_path: /metrics

For Kubernetes, use a ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arbitex-platform
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: arbitex-platform
  endpoints:
    - port: metrics
      interval: 15s

Kubernetes Probe Configuration

Recommended probe settings for platform deployments:

# In platform Pod spec
livenessProbe:
  httpGet:
    path: /api/health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /api/health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /api/health/startup
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 24    # Allow up to 2 min for startup

Error Reference

Status	Code	Description
`401`	`unauthorized`	Missing or invalid token
`403`	`forbidden`	Token lacks admin scope
`404`	`not_found`	Component name not recognized
`503`	`unhealthy`	Platform is in a degraded state

API Reference: System Health

API Reference: System Health

Liveness Probe

Readiness Probe

Startup Probe

Detailed Health Status

Component Health Details

Provider Connectivity Check

System Version

System Metrics Summary

Maintenance Mode

Enable Maintenance Mode

Disable Maintenance Mode

Prometheus Metrics Endpoint

Kubernetes Probe Configuration

Error Reference

Related