Skip to content

Routing

Arbitex Gateway routes AI requests to model providers. You control which providers and models are available through the model catalog, how requests are distributed across providers, what happens when a provider fails, how dollar budgets constrain routing decisions, and how to disable a provider or model immediately in an emergency.

The model catalog is the registry of all providers and models that Arbitex Gateway can route traffic to. Each entry in the catalog represents a (provider, model) pair with its associated metadata: context window size, capability flags, cost per token (input and output), and availability status.

Providers are registered at the platform level with a dedicated adapter that handles authentication, request formatting, response parsing, and error normalization. Each provider adapter translates the gateway’s internal request format into the provider’s native API format, so your application code uses a single consistent interface regardless of which provider receives the request.

Models are registered within each provider. When a provider releases a new model, it becomes available in the catalog after the corresponding adapter version ships. The catalog is the authoritative source of available (provider, model) pairs — routing a request to a combination not in the catalog returns a 400 error.

Arbitex supports nine providers:

ProviderProtocolNotes
AnthropicNative APIClaude model family
OpenAINative APIGPT model family
Google GeminiNative APIGemini model family
Azure OpenAIAzure-specific APIEnterprise Azure deployments with custom endpoints
AWS BedrockAWS SDKMulti-model access through AWS infrastructure
GroqOpenAI-compatibleHigh-throughput inference on custom hardware
MistralNative APIMistral model family
CohereNative APICommand model family
OllamaOpenAI-compatibleSelf-hosted open-source models

Provider API keys are configured per organization and stored encrypted. Each provider has isolated credential storage — a key configured for one organization is not accessible to any other organization.

In addition to the built-in providers, you can configure custom endpoints that follow the OpenAI-compatible API format. This supports self-hosted models, fine-tuned deployments, and providers not yet included in the built-in adapter library.

Every request specifies a routing mode that determines how the gateway processes it:

flowchart LR
subgraph Single["Single Mode"]
S_Req["Request"] --> S_GW["Gateway"] --> S_PA["Provider A"] --> S_Resp["Response"]
end
subgraph Compare["Compare Mode"]
C_Req["Request"] --> C_GW["Gateway"]
C_GW --> C_PA["Provider A"] --> C_RA["Response A"]
C_GW --> C_PB["Provider B"] --> C_RB["Response B"]
end
subgraph Summarize["Summarize Mode"]
Z_Req["Request"] --> Z_GW["Gateway"]
Z_GW --> Z_PA["Provider A"] --> Z_RA["Response A"]
Z_GW --> Z_PB["Provider B"] --> Z_RB["Response B"]
Z_RA --> Z_Sum["Summarizer"]
Z_RB --> Z_Sum
Z_Sum --> Z_Final["Final Response"]
end

The default mode. The gateway sends the request to one model on one provider and returns the response.

Use Single mode for standard conversational interactions where you want a single model’s response with the lowest possible latency.

The gateway sends the same request to two or more models in parallel and returns all responses. The caller receives each model’s response.

User Request → Gateway → Provider A → Response A
→ Provider B → Response B

Use Compare mode when evaluating model quality, testing prompt variations across providers, or giving users the ability to choose between responses.

The gateway sends the request to multiple models, collects all responses, and then sends the combined responses to a designated summarization model. The caller receives a single synthesized response.

User Request → Gateway → Provider A → Response A ─┐
→ Provider B → Response B ─┤→ Summarizer → Final Response
→ Provider C → Response C ─┘

Use Summarize mode for high-stakes queries where you want consensus across models, or for research workflows that benefit from synthesized multi-model output.

Each model in the catalog can have a fallback chain — an ordered list of alternative (provider, model) pairs that the gateway routes to if the primary model is unavailable or returning errors.

  1. A request targets model A on provider X
  2. Provider X returns a 5xx error or the request times out
  3. The gateway’s health monitor records the failure for the (provider, model) pair
  4. The gateway immediately routes the request to the next entry in the fallback chain
  5. If that entry also fails, the gateway continues down the chain
  6. If the entire chain is exhausted, the request returns an error to the caller

Fallback chain traversal is transparent to the caller — the response format is identical regardless of which chain entry ultimately handled the request. The audit log records which provider and model served the request.

Fallback chains are managed through the admin API:

Terminal window
PUT /api/providers/fallback/{model_id}
Content-Type: application/json
{
"chain": [
{"provider": "openai", "model_id": "gpt-4o"},
{"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"}
]
}
Terminal window
GET /api/providers/fallback/{model_id}
{
"model_id": "gpt-4o",
"chain": [
{"provider": "openai", "model_id": "gpt-4o"},
{"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"}
]
}

Individual requests can override the stored fallback chain by specifying fallback_model and fallback_provider in the request body. This is useful for testing or for application workflows that require request-level control over failover behavior.

The gateway monitors every (provider, model) pair in the catalog:

  • Polling interval: every 30 seconds per (provider, model) pair
  • Health endpoint: GET /api/providers/{provider}/models/{model_id}/health
  • Status values: healthy, degraded, unavailable

Health data feeds the automatic health monitor. If health polling detects consistent failures, the health monitor can disengage a provider proactively before user traffic is affected.

Each (provider, model) pair has an independent health monitor keyed by the (provider, model_id) tuple. A failing model on one provider does not affect the same model name on a different provider.

stateDiagram-v2
[*] --> Active
Active --> Disengaged : Failure threshold exceeded
Disengaged --> Testing : Lockout period elapsed (300s)
Testing --> Active : Test request succeeds
Testing --> Disengaged : Test request fails
Active : Normal operation
Active : Requests routed normally
Disengaged : Pair unavailable
Disengaged : Requests skip to fallback
Testing : One test request allowed
StateBehavior
ActiveNormal operation. Requests are routed to this (provider, model) pair.
DisengagedThe pair is unavailable. Requests skip this entry and proceed to the next in the fallback chain.
TestingThe lockout period has elapsed. The gateway routes one test request to the pair. A successful response returns the monitor to active; a failure re-disengages it.

The default lockout duration is 5 minutes (300 seconds). Recovery is automatic: after the lockout period, the health monitor transitions to half-open and allows a single test request through. If the test succeeds, the monitor returns to normal operation and traffic resumes. If it fails, the lockout restarts.

Dollar budget caps interact with routing decisions at the request level. Each organization can configure per-user or per-group token and cost budgets. When a budget is exhausted, the gateway’s routing behavior changes:

  • Token budget exhausted: the gateway blocks the request before it reaches the model provider, returning an error indicating the budget has been reached
  • Dollar budget exhausted: same behavior — the request is blocked, not routed to a cheaper provider
  • Budget-aware routing (cost-optimized mode): when the routing mode is set to cost-optimized, the gateway selects the lowest-cost model in the catalog that meets the request’s capability requirements, within the remaining budget

Budget enforcement happens in the payload analysis stage, before policy evaluation. A request blocked by a budget cap does not consume provider tokens and does not produce a DLP finding — but it does produce an audit log entry with outcome: BLOCK and block_reason: budget_exceeded.

Budget configuration is managed through the admin API at the organization level. Per-user and per-group budget overrides are supported.

The Policy Engine’s ROUTE_TO action can override the destination model for a request based on policy rule conditions — for example, routing requests from a specific user group to a lower-cost model tier, or routing requests that contain certain entity types to a more capable model.

ROUTE_TO is a terminal action: when a policy rule fires ROUTE_TO, policy evaluation stops and the gateway routes the request to the specified model or tier. This happens before the request reaches the provider, and the routing decision is recorded in the audit log.

For full details on writing ROUTE_TO rules, see Policy Engine overview.

For full details, see the dedicated Kill switch reference page.

The kill switch provides immediate, manual control to disable a provider or model. When activated:

  • All requests to the disabled (provider, model) pair are immediately blocked
  • Fallback chains skip the disabled entry automatically
  • The kill switch state is persisted and survives gateway restarts
  • The audit log records the activation with the identity of the admin who triggered it

The kill switch is available through both the admin API and the admin portal:

Terminal window
POST /api/admin/kill-switch
Content-Type: application/json
{
"provider": "openai",
"model_id": "gpt-4o",
"enabled": false,
"reason": "Provider security incident — disabling pending investigation"
}

The admin portal provides a visual toggle on the provider management page with a confirmation dialog that requires a reason string before activation.

Important: The kill switch bypasses fallback chains for the disabled entry. If gpt-4o is kill-switched and a fallback chain lists gpt-4o as a secondary entry, the gateway skips it and proceeds to the next entry in the chain. Plan your fallback chains with the assumption that any single entry may be kill-switched at any time.

  • DLP Overview — how the 3-tier DLP pipeline inspects requests after routing decisions
  • Audit Log — how routing decisions, fallback events, and kill switch activations are logged
  • Policy Engine overview — how the ROUTE_TO action interacts with routing
  • Credential Intelligence — detecting leaked credentials in routed traffic