Routing
Arbitex Gateway routes AI requests to model providers. You control which providers and models are available through the model catalog, how requests are distributed across providers, what happens when a provider fails, how dollar budgets constrain routing decisions, and how to disable a provider or model immediately in an emergency.
Model catalog
Section titled “Model catalog”The model catalog is the registry of all providers and models that Arbitex Gateway can route traffic to. Each entry in the catalog represents a (provider, model) pair with its associated metadata: context window size, capability flags, cost per token (input and output), and availability status.
How providers and models are registered
Section titled “How providers and models are registered”Providers are registered at the platform level with a dedicated adapter that handles authentication, request formatting, response parsing, and error normalization. Each provider adapter translates the gateway’s internal request format into the provider’s native API format, so your application code uses a single consistent interface regardless of which provider receives the request.
Models are registered within each provider. When a provider releases a new model, it becomes available in the catalog after the corresponding adapter version ships. The catalog is the authoritative source of available (provider, model) pairs — routing a request to a combination not in the catalog returns a 400 error.
Supported providers
Section titled “Supported providers”Arbitex supports nine providers:
| Provider | Protocol | Notes |
|---|---|---|
| Anthropic | Native API | Claude model family |
| OpenAI | Native API | GPT model family |
| Google Gemini | Native API | Gemini model family |
| Azure OpenAI | Azure-specific API | Enterprise Azure deployments with custom endpoints |
| AWS Bedrock | AWS SDK | Multi-model access through AWS infrastructure |
| Groq | OpenAI-compatible | High-throughput inference on custom hardware |
| Mistral | Native API | Mistral model family |
| Cohere | Native API | Command model family |
| Ollama | OpenAI-compatible | Self-hosted open-source models |
Provider API keys are configured per organization and stored encrypted. Each provider has isolated credential storage — a key configured for one organization is not accessible to any other organization.
Bring your own endpoint
Section titled “Bring your own endpoint”In addition to the built-in providers, you can configure custom endpoints that follow the OpenAI-compatible API format. This supports self-hosted models, fine-tuned deployments, and providers not yet included in the built-in adapter library.
Routing modes
Section titled “Routing modes”Every request specifies a routing mode that determines how the gateway processes it:
flowchart LR subgraph Single["Single Mode"] S_Req["Request"] --> S_GW["Gateway"] --> S_PA["Provider A"] --> S_Resp["Response"] end
subgraph Compare["Compare Mode"] C_Req["Request"] --> C_GW["Gateway"] C_GW --> C_PA["Provider A"] --> C_RA["Response A"] C_GW --> C_PB["Provider B"] --> C_RB["Response B"] end
subgraph Summarize["Summarize Mode"] Z_Req["Request"] --> Z_GW["Gateway"] Z_GW --> Z_PA["Provider A"] --> Z_RA["Response A"] Z_GW --> Z_PB["Provider B"] --> Z_RB["Response B"] Z_RA --> Z_Sum["Summarizer"] Z_RB --> Z_Sum Z_Sum --> Z_Final["Final Response"] endSingle mode
Section titled “Single mode”The default mode. The gateway sends the request to one model on one provider and returns the response.
Use Single mode for standard conversational interactions where you want a single model’s response with the lowest possible latency.
Compare mode
Section titled “Compare mode”The gateway sends the same request to two or more models in parallel and returns all responses. The caller receives each model’s response.
User Request → Gateway → Provider A → Response A → Provider B → Response BUse Compare mode when evaluating model quality, testing prompt variations across providers, or giving users the ability to choose between responses.
Summarize mode
Section titled “Summarize mode”The gateway sends the request to multiple models, collects all responses, and then sends the combined responses to a designated summarization model. The caller receives a single synthesized response.
User Request → Gateway → Provider A → Response A ─┐ → Provider B → Response B ─┤→ Summarizer → Final Response → Provider C → Response C ─┘Use Summarize mode for high-stakes queries where you want consensus across models, or for research workflows that benefit from synthesized multi-model output.
Fallback chains
Section titled “Fallback chains”Each model in the catalog can have a fallback chain — an ordered list of alternative (provider, model) pairs that the gateway routes to if the primary model is unavailable or returning errors.
How failover works
Section titled “How failover works”- A request targets model A on provider X
- Provider X returns a 5xx error or the request times out
- The gateway’s health monitor records the failure for the (provider, model) pair
- The gateway immediately routes the request to the next entry in the fallback chain
- If that entry also fails, the gateway continues down the chain
- If the entire chain is exhausted, the request returns an error to the caller
Fallback chain traversal is transparent to the caller — the response format is identical regardless of which chain entry ultimately handled the request. The audit log records which provider and model served the request.
How to configure fallback chains
Section titled “How to configure fallback chains”Fallback chains are managed through the admin API:
PUT /api/providers/fallback/{model_id}Content-Type: application/json
{ "chain": [ {"provider": "openai", "model_id": "gpt-4o"}, {"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"} ]}GET /api/providers/fallback/{model_id}{ "model_id": "gpt-4o", "chain": [ {"provider": "openai", "model_id": "gpt-4o"}, {"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"} ]}Individual requests can override the stored fallback chain by specifying fallback_model and fallback_provider in the request body. This is useful for testing or for application workflows that require request-level control over failover behavior.
Health monitoring
Section titled “Health monitoring”The gateway monitors every (provider, model) pair in the catalog:
- Polling interval: every 30 seconds per (provider, model) pair
- Health endpoint:
GET /api/providers/{provider}/models/{model_id}/health - Status values:
healthy,degraded,unavailable
Health data feeds the automatic health monitor. If health polling detects consistent failures, the health monitor can disengage a provider proactively before user traffic is affected.
Health monitor states
Section titled “Health monitor states”Each (provider, model) pair has an independent health monitor keyed by the (provider, model_id) tuple. A failing model on one provider does not affect the same model name on a different provider.
stateDiagram-v2 [*] --> Active Active --> Disengaged : Failure threshold exceeded Disengaged --> Testing : Lockout period elapsed (300s) Testing --> Active : Test request succeeds Testing --> Disengaged : Test request fails
Active : Normal operation Active : Requests routed normally Disengaged : Pair unavailable Disengaged : Requests skip to fallback Testing : One test request allowed| State | Behavior |
|---|---|
| Active | Normal operation. Requests are routed to this (provider, model) pair. |
| Disengaged | The pair is unavailable. Requests skip this entry and proceed to the next in the fallback chain. |
| Testing | The lockout period has elapsed. The gateway routes one test request to the pair. A successful response returns the monitor to active; a failure re-disengages it. |
The default lockout duration is 5 minutes (300 seconds). Recovery is automatic: after the lockout period, the health monitor transitions to half-open and allows a single test request through. If the test succeeds, the monitor returns to normal operation and traffic resumes. If it fails, the lockout restarts.
Budget-based routing
Section titled “Budget-based routing”Dollar budget caps interact with routing decisions at the request level. Each organization can configure per-user or per-group token and cost budgets. When a budget is exhausted, the gateway’s routing behavior changes:
- Token budget exhausted: the gateway blocks the request before it reaches the model provider, returning an error indicating the budget has been reached
- Dollar budget exhausted: same behavior — the request is blocked, not routed to a cheaper provider
- Budget-aware routing (cost-optimized mode): when the routing mode is set to
cost-optimized, the gateway selects the lowest-cost model in the catalog that meets the request’s capability requirements, within the remaining budget
Budget enforcement happens in the payload analysis stage, before policy evaluation. A request blocked by a budget cap does not consume provider tokens and does not produce a DLP finding — but it does produce an audit log entry with outcome: BLOCK and block_reason: budget_exceeded.
Budget configuration is managed through the admin API at the organization level. Per-user and per-group budget overrides are supported.
Policy Engine ROUTE_TO action
Section titled “Policy Engine ROUTE_TO action”The Policy Engine’s ROUTE_TO action can override the destination model for a request based on policy rule conditions — for example, routing requests from a specific user group to a lower-cost model tier, or routing requests that contain certain entity types to a more capable model.
ROUTE_TO is a terminal action: when a policy rule fires ROUTE_TO, policy evaluation stops and the gateway routes the request to the specified model or tier. This happens before the request reaches the provider, and the routing decision is recorded in the audit log.
For full details on writing ROUTE_TO rules, see Policy Engine overview.
Kill switch
Section titled “Kill switch”For full details, see the dedicated Kill switch reference page.
The kill switch provides immediate, manual control to disable a provider or model. When activated:
- All requests to the disabled (provider, model) pair are immediately blocked
- Fallback chains skip the disabled entry automatically
- The kill switch state is persisted and survives gateway restarts
- The audit log records the activation with the identity of the admin who triggered it
The kill switch is available through both the admin API and the admin portal:
POST /api/admin/kill-switchContent-Type: application/json
{ "provider": "openai", "model_id": "gpt-4o", "enabled": false, "reason": "Provider security incident — disabling pending investigation"}The admin portal provides a visual toggle on the provider management page with a confirmation dialog that requires a reason string before activation.
Important: The kill switch bypasses fallback chains for the disabled entry. If gpt-4o is kill-switched and a fallback chain lists gpt-4o as a secondary entry, the gateway skips it and proceeds to the next entry in the chain. Plan your fallback chains with the assumption that any single entry may be kill-switched at any time.
See also
Section titled “See also”- DLP Overview — how the 3-tier DLP pipeline inspects requests after routing decisions
- Audit Log — how routing decisions, fallback events, and kill switch activations are logged
- Policy Engine overview — how the ROUTE_TO action interacts with routing
- Credential Intelligence — detecting leaked credentials in routed traffic