Skip to content

Kill switch — operational guide

The kill switch is an emergency control, not a routine management tool. This guide covers the decision to activate it, how to assess the impact before and after, and how to confirm safe recovery when the incident is over.

For the reference documentation (API, field definitions, fallback chain interaction), see Kill switch.


Activate the kill switch when you need to stop all traffic to a provider or model immediately and cannot wait for the health monitor to react on its own.

ScenarioKill switch scope
Provider security incident — API key exposed, provider breach reportedProvider-wide (provider only, no model_id)
Model producing harmful or incorrect outputModel-specific (provider + model_id)
Provider billing anomaly — runaway spend on a specific modelModel-specific
Planned maintenance — provider scheduled downtimeProvider-wide or model-specific
Incident requiring immediate traffic cutoff for investigationProvider-wide

The kill switch is intentional and manual. Do not use it as a substitute for proper fallback chain configuration:

  • Provider is unreliable or slow — configure fallback chains and let the health monitor handle transient failures. The kill switch is for emergencies, not for degraded performance.
  • Cost optimization — use budget-based routing or policy engine ROUTE_TO rules to steer traffic away from expensive models.
  • Testing routing — use ROUTE_TO rules in the Policy Engine. Do not activate and immediately deactivate the kill switch in production for testing.
Kill switchHealth monitor
TriggerManual — admin actionAutomatic — failure threshold
ScopeProvider, model, or pair(provider, model) pair only
RecoveryManual — admin must re-enableAutomatic — after 300s lockout
PersistenceSurvives restartsResets on restart

Both can be active simultaneously. Re-enabling the kill switch does not override the health monitor — if the health monitor has also disengaged the pair, the health monitor must return to Active state independently before traffic resumes.


Before activating, determine how the kill switch will affect traffic.

A provider-wide kill switch (provider field set, model_id omitted) disables all (provider, model) pairs for that provider. Every request that targets any model on the provider will fall through to the fallback chain.

Check before activating:

  1. Which models are active on this provider? Go to Settings > Providers > [Provider] and review the model catalog entries.
  2. Do all those models have fallback chains configured? Go to the Routing page and check fallback chains for each model. Models without fallbacks will return 503 provider_unavailable to callers.
  3. What is the current traffic volume? Check the routing metrics before activating so you have a baseline.

A model-specific kill switch (provider + model_id) affects only that (provider, model) pair.

Check before activating:

  1. Is this model used as a fallback for other models? A kill-switched fallback entry is silently skipped, not an error — but it removes a layer of redundancy. Check fallback chain configurations that list this model.
  2. Is this model the only option in any chain? If any chain has this model as its only entry and no further fallbacks, callers will receive 503 provider_unavailable after activation.

Verify you have authorization to activate the kill switch for this incident. The activation reason is required and is recorded permanently in the audit log. Use a reason that accurately describes the incident:

  • "Provider security incident — openai breach reported 2026-03-10, disabling pending investigation"
  • "Model producing harmful output in prod — disabling gpt-4o pending review"
  • "test" — too vague; not acceptable for production activations

Before activating, confirm that alternative providers are available to handle traffic:

Terminal window
# Check fallback chain for the primary model being disabled
GET https://api.arbitex.ai/api/providers/fallback/{model_id}
Authorization: Bearer $ADMIN_TOKEN

Verify the chain includes at least one healthy alternative. Check health status:

Terminal window
GET https://api.arbitex.ai/api/providers/{provider}/models/{model_id}/health
Authorization: Bearer $ADMIN_TOKEN
Terminal window
POST https://api.arbitex.ai/api/admin/kill-switch
Authorization: Bearer arb_live_your-api-key-here
Content-Type: application/json
{
"provider": "openai",
"model_id": "gpt-4o",
"enabled": false,
"reason": "Provider security incident — disabling pending investigation (ticket INC-1234)"
}

For a provider-wide kill switch, omit model_id:

Terminal window
{
"provider": "openai",
"enabled": false,
"reason": "Provider security incident — all openai traffic disabled pending review (ticket INC-1234)"
}

The switch takes effect immediately. No restart required.

Check the provider status in the admin portal (Settings > Providers > [Provider Name]). The kill-switched entry should show a red “Disabled” badge.

Verify that traffic is routing to fallbacks:

Terminal window
# Check audit log for recent traffic and which providers are serving it
GET https://api.arbitex.ai/api/admin/audit-log?limit=50
Authorization: Bearer $ADMIN_TOKEN

Look at the provider field in recent audit entries. If fallbacks are working, entries should show the fallback provider, not the disabled one. If you see 503 provider_unavailable entries, requests have no viable fallback — investigate and configure fallback chains if possible.


The audit log records every kill switch activation and every request affected by it:

Terminal window
GET https://api.arbitex.ai/api/admin/audit-log?action=kill_switch_activated&limit=10

Audit entries for kill switch activations include:

FieldValue
actionkill_switch_activated
providerAffected provider
model_idAffected model, or null for provider-wide
user_idAdmin who activated the switch
reasonReason string provided at activation
timestampActivation time

Requests blocked by the kill switch produce audit entries with outcome: BLOCK and block_reason: kill_switch. Monitor these to understand traffic volume and identify callers that may need notification.

Identify teams affected by the kill switch. Check which API keys are generating traffic to the disabled provider:

Terminal window
GET https://api.arbitex.ai/api/admin/audit-log?provider=openai&limit=100&start_time=<incident_start>

Group by api_key_id or user_id to identify the heaviest affected callers. Notify them that traffic is being handled by fallback providers and set expectations for when the primary provider will return.


Verify the provider is healthy before re-enabling:

  1. Check the provider’s official status page for confirmation that the incident is resolved.
  2. Test the provider health endpoint directly:
Terminal window
GET https://api.arbitex.ai/api/providers/openai/models/gpt-4o/health
Authorization: Bearer $ADMIN_TOKEN

A status: "healthy" response indicates the health monitor is reporting normal operation. If the health monitor still shows degraded or unavailable, wait until it returns to healthy before re-enabling the kill switch — re-enabling while the provider is degraded will route live traffic to an unstable provider.

Terminal window
POST https://api.arbitex.ai/api/admin/kill-switch
Authorization: Bearer arb_live_your-api-key-here
Content-Type: application/json
{
"provider": "openai",
"model_id": "gpt-4o",
"enabled": true,
"reason": "Provider incident resolved — re-enabling after verification (ticket INC-1234 closed)"
}

The re-enable reason is recorded in the audit log alongside the original activation.

After re-enabling:

  1. Verify traffic resumes — check the audit log for requests successfully routed to the re-enabled provider within 1–2 minutes.
  2. Verify fallback chains — confirm that the health monitor has returned the (provider, model) pair to Active state. If the health monitor disengaged the pair independently during the incident (in addition to the kill switch), the health monitor’s 300-second lockout must expire before the pair returns to normal routing.
  3. Audit log completeness — verify that the full incident is documented: activation entry, any requests blocked during the incident, and the re-enable entry. Export this record if an incident report is required.

For incidents involving multiple providers simultaneously (e.g., a shared infrastructure failure), activate kill switches in sequence:

Terminal window
# Disable provider 1
POST /api/admin/kill-switch { "provider": "openai", "enabled": false, "reason": "..." }
# Disable provider 2
POST /api/admin/kill-switch { "provider": "anthropic", "enabled": false, "reason": "..." }

Each activation produces its own audit entry. When recovering, re-enable in reverse order (re-enable the provider you’re most confident about first, verify traffic, then re-enable the next).


For scheduled maintenance where you have advance notice, communicate with affected teams before activating. Document the planned window in the reason string:

{
"provider": "azure_openai",
"enabled": false,
"reason": "Planned maintenance window 2026-03-15 02:00–04:00 UTC — INC-5678"
}

Pre-configure fallback chains to route traffic to alternatives during the maintenance window so callers experience no service disruption.


  • Kill switch — API reference, field definitions, fallback chain interaction
  • Routing — fallback chains and health monitoring
  • Audit Log — querying kill switch events
  • Provider management — health monitor states and provider configuration