Routing

Arbitex Gateway routes AI requests to model providers. You control which providers and models are available through the model catalog, how requests are distributed across providers, what happens when a provider fails, how dollar budgets constrain routing decisions, and how to disable a provider or model immediately in an emergency.

Model catalog

The model catalog is the registry of all providers and models that Arbitex Gateway can route traffic to. Each entry in the catalog represents a (provider, model) pair with its associated metadata: context window size, capability flags, cost per token (input and output), and availability status.

How providers and models are registered

Providers are registered at the platform level with a dedicated adapter that handles authentication, request formatting, response parsing, and error normalization. Each provider adapter translates the gateway’s internal request format into the provider’s native API format, so your application code uses a single consistent interface regardless of which provider receives the request.

Models are registered within each provider. When a provider releases a new model, it becomes available in the catalog after the corresponding adapter version ships. The catalog is the authoritative source of available (provider, model) pairs — routing a request to a combination not in the catalog returns a 400 error.

Supported providers

Arbitex supports nine providers:

Provider	Protocol	Notes
Anthropic	Native API	Claude model family
OpenAI	Native API	GPT model family
Google Gemini	Native API	Gemini model family
Azure OpenAI	Azure-specific API	Enterprise Azure deployments with custom endpoints
AWS Bedrock	AWS SDK	Multi-model access through AWS infrastructure
Groq	OpenAI-compatible	High-throughput inference on custom hardware
Mistral	Native API	Mistral model family
Cohere	Native API	Command model family
Ollama	OpenAI-compatible	Self-hosted open-source models

Provider API keys are configured per organization and stored encrypted. Each provider has isolated credential storage — a key configured for one organization is not accessible to any other organization.

Bring your own endpoint

In addition to the built-in providers, you can configure custom endpoints that follow the OpenAI-compatible API format. This supports self-hosted models, fine-tuned deployments, and providers not yet included in the built-in adapter library.

Routing modes

Every request specifies a routing mode that determines how the gateway processes it:

flowchart LR
    subgraph Single["Single Mode"]
        S_Req["Request"] --> S_GW["Gateway"] --> S_PA["Provider A"] --> S_Resp["Response"]
    end

    subgraph Compare["Compare Mode"]
        C_Req["Request"] --> C_GW["Gateway"]
        C_GW --> C_PA["Provider A"] --> C_RA["Response A"]
        C_GW --> C_PB["Provider B"] --> C_RB["Response B"]
    end

    subgraph Summarize["Summarize Mode"]
        Z_Req["Request"] --> Z_GW["Gateway"]
        Z_GW --> Z_PA["Provider A"] --> Z_RA["Response A"]
        Z_GW --> Z_PB["Provider B"] --> Z_RB["Response B"]
        Z_RA --> Z_Sum["Summarizer"]
        Z_RB --> Z_Sum
        Z_Sum --> Z_Final["Final Response"]
    end

Single mode

The default mode. The gateway sends the request to one model on one provider and returns the response.

Use Single mode for standard conversational interactions where you want a single model’s response with the lowest possible latency.

Compare mode

The gateway sends the same request to two or more models in parallel and returns all responses. The caller receives each model’s response.

User Request → Gateway → Provider A → Response A
                       → Provider B → Response B

Use Compare mode when evaluating model quality, testing prompt variations across providers, or giving users the ability to choose between responses.

Summarize mode

The gateway sends the request to multiple models, collects all responses, and then sends the combined responses to a designated summarization model. The caller receives a single synthesized response.

User Request → Gateway → Provider A → Response A ─┐
                       → Provider B → Response B ─┤→ Summarizer → Final Response
                       → Provider C → Response C ─┘

Use Summarize mode for high-stakes queries where you want consensus across models, or for research workflows that benefit from synthesized multi-model output.

Fallback chains

Each model in the catalog can have a fallback chain — an ordered list of alternative (provider, model) pairs that the gateway routes to if the primary model is unavailable or returning errors.

How failover works

A request targets model A on provider X
Provider X returns a 5xx error or the request times out
The gateway’s health monitor records the failure for the (provider, model) pair
The gateway immediately routes the request to the next entry in the fallback chain
If that entry also fails, the gateway continues down the chain
If the entire chain is exhausted, the request returns an error to the caller

Fallback chain traversal is transparent to the caller — the response format is identical regardless of which chain entry ultimately handled the request. The audit log records which provider and model served the request.

How to configure fallback chains

Fallback chains are managed through the admin API:

PUT /api/providers/fallback/{model_id}
Content-Type: application/json

{
  "chain": [
    {"provider": "openai", "model_id": "gpt-4o"},
    {"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"}
  ]
}

GET /api/providers/fallback/{model_id}

{
  "model_id": "gpt-4o",
  "chain": [
    {"provider": "openai", "model_id": "gpt-4o"},
    {"provider": "anthropic", "model_id": "claude-sonnet-4-20250514"}
  ]
}

Individual requests can override the stored fallback chain by specifying fallback_model and fallback_provider in the request body. This is useful for testing or for application workflows that require request-level control over failover behavior.

Health monitoring

The gateway monitors every (provider, model) pair in the catalog:

Polling interval: every 30 seconds per (provider, model) pair
Health endpoint: GET /api/providers/{provider}/models/{model_id}/health
Status values: healthy, degraded, unavailable

Health data feeds the automatic health monitor. If health polling detects consistent failures, the health monitor can disengage a provider proactively before user traffic is affected.

Health monitor states

Each (provider, model) pair has an independent health monitor keyed by the (provider, model_id) tuple. A failing model on one provider does not affect the same model name on a different provider.

stateDiagram-v2
    [*] --> Active
    Active --> Disengaged : Failure threshold exceeded
    Disengaged --> Testing : Lockout period elapsed (300s)
    Testing --> Active : Test request succeeds
    Testing --> Disengaged : Test request fails

    Active : Normal operation
    Active : Requests routed normally
    Disengaged : Pair unavailable
    Disengaged : Requests skip to fallback
    Testing : One test request allowed

State	Behavior
Active	Normal operation. Requests are routed to this (provider, model) pair.
Disengaged	The pair is unavailable. Requests skip this entry and proceed to the next in the fallback chain.
Testing	The lockout period has elapsed. The gateway routes one test request to the pair. A successful response returns the monitor to active; a failure re-disengages it.

The default lockout duration is 5 minutes (300 seconds). Recovery is automatic: after the lockout period, the health monitor transitions to half-open and allows a single test request through. If the test succeeds, the monitor returns to normal operation and traffic resumes. If it fails, the lockout restarts.

Budget-based routing

Dollar budget caps interact with routing decisions at the request level. Each organization can configure per-user or per-group token and cost budgets. When a budget is exhausted, the gateway’s routing behavior changes:

Token budget exhausted: the gateway blocks the request before it reaches the model provider, returning an error indicating the budget has been reached
Dollar budget exhausted: same behavior — the request is blocked, not routed to a cheaper provider
Budget-aware routing (cost-optimized mode): when the routing mode is set to cost-optimized, the gateway selects the lowest-cost model in the catalog that meets the request’s capability requirements, within the remaining budget

Budget enforcement happens in the payload analysis stage, before policy evaluation. A request blocked by a budget cap does not consume provider tokens and does not produce a DLP finding — but it does produce an audit log entry with outcome: BLOCK and block_reason: budget_exceeded.

Budget configuration is managed through the admin API at the organization level. Per-user and per-group budget overrides are supported.

Policy Engine ROUTE_TO action

The Policy Engine’s ROUTE_TO action can override the destination model for a request based on policy rule conditions — for example, routing requests from a specific user group to a lower-cost model tier, or routing requests that contain certain entity types to a more capable model.

ROUTE_TO is a terminal action: when a policy rule fires ROUTE_TO, policy evaluation stops and the gateway routes the request to the specified model or tier. This happens before the request reaches the provider, and the routing decision is recorded in the audit log.

For full details on writing ROUTE_TO rules, see Policy Engine overview.

Kill switch

For full details, see the dedicated Kill switch reference page.

The kill switch provides immediate, manual control to disable a provider or model. When activated:

All requests to the disabled (provider, model) pair are immediately blocked
Fallback chains skip the disabled entry automatically
The kill switch state is persisted and survives gateway restarts
The audit log records the activation with the identity of the admin who triggered it

The kill switch is available through both the admin API and the admin portal:

POST /api/admin/kill-switch
Content-Type: application/json

{
  "provider": "openai",
  "model_id": "gpt-4o",
  "enabled": false,
  "reason": "Provider security incident — disabling pending investigation"
}

The admin portal provides a visual toggle on the provider management page with a confirmation dialog that requires a reason string before activation.

Important: The kill switch bypasses fallback chains for the disabled entry. If gpt-4o is kill-switched and a fallback chain lists gpt-4o as a secondary entry, the gateway skips it and proceeds to the next entry in the chain. Plan your fallback chains with the assumption that any single entry may be kill-switched at any time.