Kill switch — operational guide

The kill switch is an emergency control, not a routine management tool. This guide covers the decision to activate it, how to assess the impact before and after, and how to confirm safe recovery when the incident is over.

For the reference documentation (API, field definitions, fallback chain interaction), see Kill switch.

When to use the kill switch

Activate the kill switch when you need to stop all traffic to a provider or model immediately and cannot wait for the health monitor to react on its own.

Appropriate use cases

Scenario	Kill switch scope
Provider security incident — API key exposed, provider breach reported	Provider-wide (`provider` only, no `model_id`)
Model producing harmful or incorrect output	Model-specific (`provider` + `model_id`)
Provider billing anomaly — runaway spend on a specific model	Model-specific
Planned maintenance — provider scheduled downtime	Provider-wide or model-specific
Incident requiring immediate traffic cutoff for investigation	Provider-wide

When not to use the kill switch

The kill switch is intentional and manual. Do not use it as a substitute for proper fallback chain configuration:

Provider is unreliable or slow — configure fallback chains and let the health monitor handle transient failures. The kill switch is for emergencies, not for degraded performance.
Cost optimization — use budget-based routing or policy engine ROUTE_TO rules to steer traffic away from expensive models.
Testing routing — use ROUTE_TO rules in the Policy Engine. Do not activate and immediately deactivate the kill switch in production for testing.

Kill switch vs health monitor

	Kill switch	Health monitor
Trigger	Manual — admin action	Automatic — failure threshold
Scope	Provider, model, or pair	(provider, model) pair only
Recovery	Manual — admin must re-enable	Automatic — after 300s lockout
Persistence	Survives restarts	Resets on restart

Both can be active simultaneously. Re-enabling the kill switch does not override the health monitor — if the health monitor has also disengaged the pair, the health monitor must return to Active state independently before traffic resumes.

Assessing blast radius before activation

Before activating, determine how the kill switch will affect traffic.

Provider-wide kill switch

A provider-wide kill switch (provider field set, model_id omitted) disables all (provider, model) pairs for that provider. Every request that targets any model on the provider will fall through to the fallback chain.

Check before activating:

Which models are active on this provider? Go to Settings > Providers > [Provider] and review the model catalog entries.
Do all those models have fallback chains configured? Go to the Routing page and check fallback chains for each model. Models without fallbacks will return 503 provider_unavailable to callers.
What is the current traffic volume? Check the routing metrics before activating so you have a baseline.

Model-specific kill switch

A model-specific kill switch (provider + model_id) affects only that (provider, model) pair.

Check before activating:

Is this model used as a fallback for other models? A kill-switched fallback entry is silently skipped, not an error — but it removes a layer of redundancy. Check fallback chain configurations that list this model.
Is this model the only option in any chain? If any chain has this model as its only entry and no further fallbacks, callers will receive 503 provider_unavailable after activation.

Activation procedure

Step 1 — confirm decision

Verify you have authorization to activate the kill switch for this incident. The activation reason is required and is recorded permanently in the audit log. Use a reason that accurately describes the incident:

✓ "Provider security incident — openai breach reported 2026-03-10, disabling pending investigation"
✓ "Model producing harmful output in prod — disabling gpt-4o pending review"
✗ "test" — too vague; not acceptable for production activations

Step 2 — verify fallback chains

Before activating, confirm that alternative providers are available to handle traffic:

# Check fallback chain for the primary model being disabled
GET https://api.arbitex.ai/api/providers/fallback/{model_id}
Authorization: Bearer $ADMIN_TOKEN

Verify the chain includes at least one healthy alternative. Check health status:

GET https://api.arbitex.ai/api/providers/{provider}/models/{model_id}/health
Authorization: Bearer $ADMIN_TOKEN

Step 3 — activate

POST https://api.arbitex.ai/api/admin/kill-switch
Authorization: Bearer arb_live_your-api-key-here
Content-Type: application/json

{
  "provider": "openai",
  "model_id": "gpt-4o",
  "enabled": false,
  "reason": "Provider security incident — disabling pending investigation (ticket INC-1234)"
}

For a provider-wide kill switch, omit model_id:

{
  "provider": "openai",
  "enabled": false,
  "reason": "Provider security incident — all openai traffic disabled pending review (ticket INC-1234)"
}

The switch takes effect immediately. No restart required.

Step 4 — verify the switch is active

Check the provider status in the admin portal (Settings > Providers > [Provider Name]). The kill-switched entry should show a red “Disabled” badge.

Verify that traffic is routing to fallbacks:

# Check audit log for recent traffic and which providers are serving it
GET https://api.arbitex.ai/api/admin/audit-log?limit=50
Authorization: Bearer $ADMIN_TOKEN

Look at the provider field in recent audit entries. If fallbacks are working, entries should show the fallback provider, not the disabled one. If you see 503 provider_unavailable entries, requests have no viable fallback — investigate and configure fallback chains if possible.

During the incident

Monitor the audit trail

The audit log records every kill switch activation and every request affected by it:

GET https://api.arbitex.ai/api/admin/audit-log?action=kill_switch_activated&limit=10

Audit entries for kill switch activations include:

Field	Value
`action`	`kill_switch_activated`
`provider`	Affected provider
`model_id`	Affected model, or `null` for provider-wide
`user_id`	Admin who activated the switch
`reason`	Reason string provided at activation
`timestamp`	Activation time

Requests blocked by the kill switch produce audit entries with outcome: BLOCK and block_reason: kill_switch. Monitor these to understand traffic volume and identify callers that may need notification.

Communicate impact

Identify teams affected by the kill switch. Check which API keys are generating traffic to the disabled provider:

GET https://api.arbitex.ai/api/admin/audit-log?provider=openai&limit=100&start_time=<incident_start>

Group by api_key_id or user_id to identify the heaviest affected callers. Notify them that traffic is being handled by fallback providers and set expectations for when the primary provider will return.

Recovery and re-enable

Before re-enabling

Verify the provider is healthy before re-enabling:

Check the provider’s official status page for confirmation that the incident is resolved.
Test the provider health endpoint directly:

GET https://api.arbitex.ai/api/providers/openai/models/gpt-4o/health
Authorization: Bearer $ADMIN_TOKEN

A status: "healthy" response indicates the health monitor is reporting normal operation. If the health monitor still shows degraded or unavailable, wait until it returns to healthy before re-enabling the kill switch — re-enabling while the provider is degraded will route live traffic to an unstable provider.

Re-enable procedure

POST https://api.arbitex.ai/api/admin/kill-switch
Authorization: Bearer arb_live_your-api-key-here
Content-Type: application/json

{
  "provider": "openai",
  "model_id": "gpt-4o",
  "enabled": true,
  "reason": "Provider incident resolved — re-enabling after verification (ticket INC-1234 closed)"
}

The re-enable reason is recorded in the audit log alongside the original activation.

Post-incident verification

After re-enabling:

Verify traffic resumes — check the audit log for requests successfully routed to the re-enabled provider within 1–2 minutes.
Verify fallback chains — confirm that the health monitor has returned the (provider, model) pair to Active state. If the health monitor disengaged the pair independently during the incident (in addition to the kill switch), the health monitor’s 300-second lockout must expire before the pair returns to normal routing.
Audit log completeness — verify that the full incident is documented: activation entry, any requests blocked during the incident, and the re-enable entry. Export this record if an incident report is required.

Multi-provider incident

For incidents involving multiple providers simultaneously (e.g., a shared infrastructure failure), activate kill switches in sequence:

# Disable provider 1
POST /api/admin/kill-switch { "provider": "openai", "enabled": false, "reason": "..." }

# Disable provider 2
POST /api/admin/kill-switch { "provider": "anthropic", "enabled": false, "reason": "..." }

Each activation produces its own audit entry. When recovering, re-enable in reverse order (re-enable the provider you’re most confident about first, verify traffic, then re-enable the next).

Planned maintenance

For scheduled maintenance where you have advance notice, communicate with affected teams before activating. Document the planned window in the reason string:

{
  "provider": "azure_openai",
  "enabled": false,
  "reason": "Planned maintenance window 2026-03-15 02:00–04:00 UTC — INC-5678"
}

Pre-configure fallback chains to route traffic to alternatives during the maintenance window so callers experience no service disruption.