Kill switch — operational guide
The kill switch is an emergency control, not a routine management tool. This guide covers the decision to activate it, how to assess the impact before and after, and how to confirm safe recovery when the incident is over.
For the reference documentation (API, field definitions, fallback chain interaction), see Kill switch.
When to use the kill switch
Section titled “When to use the kill switch”Activate the kill switch when you need to stop all traffic to a provider or model immediately and cannot wait for the health monitor to react on its own.
Appropriate use cases
Section titled “Appropriate use cases”| Scenario | Kill switch scope |
|---|---|
| Provider security incident — API key exposed, provider breach reported | Provider-wide (provider only, no model_id) |
| Model producing harmful or incorrect output | Model-specific (provider + model_id) |
| Provider billing anomaly — runaway spend on a specific model | Model-specific |
| Planned maintenance — provider scheduled downtime | Provider-wide or model-specific |
| Incident requiring immediate traffic cutoff for investigation | Provider-wide |
When not to use the kill switch
Section titled “When not to use the kill switch”The kill switch is intentional and manual. Do not use it as a substitute for proper fallback chain configuration:
- Provider is unreliable or slow — configure fallback chains and let the health monitor handle transient failures. The kill switch is for emergencies, not for degraded performance.
- Cost optimization — use budget-based routing or policy engine
ROUTE_TOrules to steer traffic away from expensive models. - Testing routing — use
ROUTE_TOrules in the Policy Engine. Do not activate and immediately deactivate the kill switch in production for testing.
Kill switch vs health monitor
Section titled “Kill switch vs health monitor”| Kill switch | Health monitor | |
|---|---|---|
| Trigger | Manual — admin action | Automatic — failure threshold |
| Scope | Provider, model, or pair | (provider, model) pair only |
| Recovery | Manual — admin must re-enable | Automatic — after 300s lockout |
| Persistence | Survives restarts | Resets on restart |
Both can be active simultaneously. Re-enabling the kill switch does not override the health monitor — if the health monitor has also disengaged the pair, the health monitor must return to Active state independently before traffic resumes.
Assessing blast radius before activation
Section titled “Assessing blast radius before activation”Before activating, determine how the kill switch will affect traffic.
Provider-wide kill switch
Section titled “Provider-wide kill switch”A provider-wide kill switch (provider field set, model_id omitted) disables all (provider, model) pairs for that provider. Every request that targets any model on the provider will fall through to the fallback chain.
Check before activating:
- Which models are active on this provider? Go to Settings > Providers > [Provider] and review the model catalog entries.
- Do all those models have fallback chains configured? Go to the Routing page and check fallback chains for each model. Models without fallbacks will return
503 provider_unavailableto callers. - What is the current traffic volume? Check the routing metrics before activating so you have a baseline.
Model-specific kill switch
Section titled “Model-specific kill switch”A model-specific kill switch (provider + model_id) affects only that (provider, model) pair.
Check before activating:
- Is this model used as a fallback for other models? A kill-switched fallback entry is silently skipped, not an error — but it removes a layer of redundancy. Check fallback chain configurations that list this model.
- Is this model the only option in any chain? If any chain has this model as its only entry and no further fallbacks, callers will receive
503 provider_unavailableafter activation.
Activation procedure
Section titled “Activation procedure”Step 1 — confirm decision
Section titled “Step 1 — confirm decision”Verify you have authorization to activate the kill switch for this incident. The activation reason is required and is recorded permanently in the audit log. Use a reason that accurately describes the incident:
- ✓
"Provider security incident — openai breach reported 2026-03-10, disabling pending investigation" - ✓
"Model producing harmful output in prod — disabling gpt-4o pending review" - ✗
"test"— too vague; not acceptable for production activations
Step 2 — verify fallback chains
Section titled “Step 2 — verify fallback chains”Before activating, confirm that alternative providers are available to handle traffic:
# Check fallback chain for the primary model being disabledGET https://api.arbitex.ai/api/providers/fallback/{model_id}Authorization: Bearer $ADMIN_TOKENVerify the chain includes at least one healthy alternative. Check health status:
GET https://api.arbitex.ai/api/providers/{provider}/models/{model_id}/healthAuthorization: Bearer $ADMIN_TOKENStep 3 — activate
Section titled “Step 3 — activate”POST https://api.arbitex.ai/api/admin/kill-switchAuthorization: Bearer arb_live_your-api-key-hereContent-Type: application/json
{ "provider": "openai", "model_id": "gpt-4o", "enabled": false, "reason": "Provider security incident — disabling pending investigation (ticket INC-1234)"}For a provider-wide kill switch, omit model_id:
{ "provider": "openai", "enabled": false, "reason": "Provider security incident — all openai traffic disabled pending review (ticket INC-1234)"}The switch takes effect immediately. No restart required.
Step 4 — verify the switch is active
Section titled “Step 4 — verify the switch is active”Check the provider status in the admin portal (Settings > Providers > [Provider Name]). The kill-switched entry should show a red “Disabled” badge.
Verify that traffic is routing to fallbacks:
# Check audit log for recent traffic and which providers are serving itGET https://api.arbitex.ai/api/admin/audit-log?limit=50Authorization: Bearer $ADMIN_TOKENLook at the provider field in recent audit entries. If fallbacks are working, entries should show the fallback provider, not the disabled one. If you see 503 provider_unavailable entries, requests have no viable fallback — investigate and configure fallback chains if possible.
During the incident
Section titled “During the incident”Monitor the audit trail
Section titled “Monitor the audit trail”The audit log records every kill switch activation and every request affected by it:
GET https://api.arbitex.ai/api/admin/audit-log?action=kill_switch_activated&limit=10Audit entries for kill switch activations include:
| Field | Value |
|---|---|
action | kill_switch_activated |
provider | Affected provider |
model_id | Affected model, or null for provider-wide |
user_id | Admin who activated the switch |
reason | Reason string provided at activation |
timestamp | Activation time |
Requests blocked by the kill switch produce audit entries with outcome: BLOCK and block_reason: kill_switch. Monitor these to understand traffic volume and identify callers that may need notification.
Communicate impact
Section titled “Communicate impact”Identify teams affected by the kill switch. Check which API keys are generating traffic to the disabled provider:
GET https://api.arbitex.ai/api/admin/audit-log?provider=openai&limit=100&start_time=<incident_start>Group by api_key_id or user_id to identify the heaviest affected callers. Notify them that traffic is being handled by fallback providers and set expectations for when the primary provider will return.
Recovery and re-enable
Section titled “Recovery and re-enable”Before re-enabling
Section titled “Before re-enabling”Verify the provider is healthy before re-enabling:
- Check the provider’s official status page for confirmation that the incident is resolved.
- Test the provider health endpoint directly:
GET https://api.arbitex.ai/api/providers/openai/models/gpt-4o/healthAuthorization: Bearer $ADMIN_TOKENA status: "healthy" response indicates the health monitor is reporting normal operation. If the health monitor still shows degraded or unavailable, wait until it returns to healthy before re-enabling the kill switch — re-enabling while the provider is degraded will route live traffic to an unstable provider.
Re-enable procedure
Section titled “Re-enable procedure”POST https://api.arbitex.ai/api/admin/kill-switchAuthorization: Bearer arb_live_your-api-key-hereContent-Type: application/json
{ "provider": "openai", "model_id": "gpt-4o", "enabled": true, "reason": "Provider incident resolved — re-enabling after verification (ticket INC-1234 closed)"}The re-enable reason is recorded in the audit log alongside the original activation.
Post-incident verification
Section titled “Post-incident verification”After re-enabling:
- Verify traffic resumes — check the audit log for requests successfully routed to the re-enabled provider within 1–2 minutes.
- Verify fallback chains — confirm that the health monitor has returned the (provider, model) pair to Active state. If the health monitor disengaged the pair independently during the incident (in addition to the kill switch), the health monitor’s 300-second lockout must expire before the pair returns to normal routing.
- Audit log completeness — verify that the full incident is documented: activation entry, any requests blocked during the incident, and the re-enable entry. Export this record if an incident report is required.
Multi-provider incident
Section titled “Multi-provider incident”For incidents involving multiple providers simultaneously (e.g., a shared infrastructure failure), activate kill switches in sequence:
# Disable provider 1POST /api/admin/kill-switch { "provider": "openai", "enabled": false, "reason": "..." }
# Disable provider 2POST /api/admin/kill-switch { "provider": "anthropic", "enabled": false, "reason": "..." }Each activation produces its own audit entry. When recovering, re-enable in reverse order (re-enable the provider you’re most confident about first, verify traffic, then re-enable the next).
Planned maintenance
Section titled “Planned maintenance”For scheduled maintenance where you have advance notice, communicate with affected teams before activating. Document the planned window in the reason string:
{ "provider": "azure_openai", "enabled": false, "reason": "Planned maintenance window 2026-03-15 02:00–04:00 UTC — INC-5678"}Pre-configure fallback chains to route traffic to alternatives during the maintenance window so callers experience no service disruption.
See also
Section titled “See also”- Kill switch — API reference, field definitions, fallback chain interaction
- Routing — fallback chains and health monitoring
- Audit Log — querying kill switch events
- Provider management — health monitor states and provider configuration