Model routing configuration

Model routing determines which provider and model serves each AI request. When the primary model is unavailable or degraded, the gateway routes requests to the next available model in the fallback chain. This guide covers how to configure fallback chains, interpret latency thresholds, and read health-based routing state.

How routing works

Every request specifies a target model. The gateway resolves the request against the following routing layers in order:

Policy routing — if an active policy rule has a ROUTE_TO action matching the request, the request is directed to the policy-specified model or tier. See Policy Engine testing for simulation patterns.
Primary model — if no policy routing applies, the request is forwarded to the provider hosting the requested model.
Fallback chain — if the primary provider fails health checks (circuit breaker open) or returns a 5xx error, the gateway attempts each fallback in priority order.

If all models in the fallback chain are unavailable, the request returns a 503 Service Unavailable error.

Fallback chain configuration

Navigate to Admin → Routing → Fallback Chains to manage per-model fallback chains.

How fallback chains work

Each model has an independent fallback chain — an ordered list of alternative models to try when the primary fails. The gateway attempts fallbacks in priority order (priority 1 first, then 2, and so on).

A fallback is triggered when:

The primary provider’s circuit breaker is open (3 or more consecutive failures).
The primary provider returns a 5xx response code.
The primary provider’s p95 latency exceeds the configured threshold.

Adding a fallback chain

Navigate to Admin → Routing → Fallback Chains.
Locate the model by searching by name or filtering by provider.
Click the model row to expand it.
In the Select a model to add as fallback dropdown, choose the alternative model and provider.
Click Add.
Reorder entries using the up/down arrows. Lower priority numbers are tried first.
Click Save Changes to persist.

The save button appears only when unsaved changes exist. Navigating away without saving discards changes.

Removing a fallback entry

Expand the model row.
Click Remove next to the fallback entry.
Click Save Changes.

Reordering fallback priority

Each fallback entry has a priority number displayed in the leftmost column. Lower numbers are tried first (priority 1 before priority 2). Use the ▲/▼ buttons to move entries up or down. Priority numbers are automatically renumbered when entries are added, removed, or reordered.

API

Fallback chain configuration is also accessible via the API:

GET /api/providers/fallback/{model_id}
PUT /api/providers/fallback/{model_id}

PUT request body:

{
  "model_id": "claude-3-5-sonnet-20241022",
  "fallbacks": [
    {
      "model_id": "gpt-4o",
      "provider_name": "openai",
      "priority": 1
    },
    {
      "model_id": "gemini-1.5-pro",
      "provider_name": "google",
      "priority": 2
    }
  ]
}

Fallback entries are evaluated in ascending priority order. Set fallbacks to an empty array to remove all fallbacks for a model.

Latency monitoring

Navigate to Admin → Monitoring → Latency to view per-model latency percentiles.

Latency metrics

The latency monitor shows the following metrics for each model, aggregated over the selected time window:

Metric	Description
p50	50th percentile (median) response time
p95	95th percentile — the latency that 95% of requests complete within
p99	99th percentile — the latency that 99% of requests complete within
Avg	Mean response time across all requests
Requests	Total request count in the selected window
Trend	Direction vs. the previous equivalent window

Time windows

Use the time window selector to change the aggregation period:

Window	Description
1 Hour	Last 60 minutes
24 Hours	Last 24 hours
7 Days	Last 7 days
30 Days	Last 30 days

Click Refresh to reload the current window’s data.

Latency status thresholds

The status indicator next to each p50 and p95 value reflects the following thresholds:

Status	p50 / p95 value
Healthy (green)	< 200 ms
Warning (yellow)	200–499 ms
Critical (red)	≥ 500 ms

p99 and avg values are displayed without status indicators.

Trend indicators

The Trend column shows the direction of change compared to the previous equivalent window:

Indicator	Meaning
↑ (red)	Latency increased — degrading
↓ (green)	Latency decreased — improving
— (grey)	No significant change

A red upward trend on the p95 column for a provider indicates emerging latency that may trigger fallback routing.

Health-based routing

The gateway uses a circuit breaker per provider to detect and isolate failing providers automatically.

Circuit breaker states

State	Meaning
Closed	Provider is healthy. Requests are forwarded normally.
Open	Provider has failed health checks. Requests are routed to the fallback chain.
Half-open	Gateway is testing the provider with a single request. If it succeeds, the circuit closes. If it fails, the circuit remains open.

Health score

The Health score (0.0–1.0) reflects the provider’s recent reliability:

A score of 1.0 means no failures in the current window.
A score approaching 0.0 means most recent requests have failed.

The health score is derived from the failure rate across the rolling request window. A score below ~0.5 typically means the circuit breaker has opened or is about to open.

Failure count and failure rate

Failure count — consecutive failures since the last successful request. The circuit opens at 3 consecutive failures.
Failure rate — proportion of failed requests in the sliding window.

Recovery is automatic. After the circuit opens, the gateway checks the provider at intervals. When 5 consecutive health checks pass, the circuit closes and the provider re-enters normal routing.

Provider filtering in the fallback manager

The fallback chain manager supports filtering to find models quickly:

Search — filter by model name, model ID, or provider name.
Provider filter — dropdown to show only models from a specific provider.
Click Clear Filters to reset both filters.

The total number of matching models is shown next to the filter controls.

Planning fallback chains

When designing fallback chains, consider:

Provider diversity — configure fallbacks across different providers (e.g., Anthropic primary → OpenAI fallback) rather than multiple models from the same provider. A provider-wide outage would otherwise exhaust the entire chain.

Capability parity — fallback models should support the same capabilities required by your workload (streaming, function calling, context length). Routing to a less capable model may cause application-level errors even when the model responds successfully.

Cost implications — fallback models may have different token pricing. Review provider costs in Admin → Providers → Model Catalog before configuring fallback chains across price tiers.

Latency expectations — a fallback to a higher-latency provider may affect user-facing response times. Monitor the latency window after configuring new chains to confirm acceptable performance.