Skip to content

Model routing configuration

Model routing determines which provider and model serves each AI request. When the primary model is unavailable or degraded, the gateway routes requests to the next available model in the fallback chain. This guide covers how to configure fallback chains, interpret latency thresholds, and read health-based routing state.


Every request specifies a target model. The gateway resolves the request against the following routing layers in order:

  1. Policy routing — if an active policy rule has a ROUTE_TO action matching the request, the request is directed to the policy-specified model or tier. See Policy Engine testing for simulation patterns.
  2. Primary model — if no policy routing applies, the request is forwarded to the provider hosting the requested model.
  3. Fallback chain — if the primary provider fails health checks (circuit breaker open) or returns a 5xx error, the gateway attempts each fallback in priority order.

If all models in the fallback chain are unavailable, the request returns a 503 Service Unavailable error.


Navigate to Admin → Routing → Fallback Chains to manage per-model fallback chains.

Each model has an independent fallback chain — an ordered list of alternative models to try when the primary fails. The gateway attempts fallbacks in priority order (priority 1 first, then 2, and so on).

A fallback is triggered when:

  • The primary provider’s circuit breaker is open (3 or more consecutive failures).
  • The primary provider returns a 5xx response code.
  • The primary provider’s p95 latency exceeds the configured threshold.
  1. Navigate to Admin → Routing → Fallback Chains.
  2. Locate the model by searching by name or filtering by provider.
  3. Click the model row to expand it.
  4. In the Select a model to add as fallback dropdown, choose the alternative model and provider.
  5. Click Add.
  6. Reorder entries using the up/down arrows. Lower priority numbers are tried first.
  7. Click Save Changes to persist.

The save button appears only when unsaved changes exist. Navigating away without saving discards changes.

  1. Expand the model row.
  2. Click Remove next to the fallback entry.
  3. Click Save Changes.

Each fallback entry has a priority number displayed in the leftmost column. Lower numbers are tried first (priority 1 before priority 2). Use the ▲/▼ buttons to move entries up or down. Priority numbers are automatically renumbered when entries are added, removed, or reordered.

Fallback chain configuration is also accessible via the API:

GET /api/providers/fallback/{model_id}
PUT /api/providers/fallback/{model_id}

PUT request body:

{
"model_id": "claude-3-5-sonnet-20241022",
"fallbacks": [
{
"model_id": "gpt-4o",
"provider_name": "openai",
"priority": 1
},
{
"model_id": "gemini-1.5-pro",
"provider_name": "google",
"priority": 2
}
]
}

Fallback entries are evaluated in ascending priority order. Set fallbacks to an empty array to remove all fallbacks for a model.


Navigate to Admin → Monitoring → Latency to view per-model latency percentiles.

The latency monitor shows the following metrics for each model, aggregated over the selected time window:

MetricDescription
p5050th percentile (median) response time
p9595th percentile — the latency that 95% of requests complete within
p9999th percentile — the latency that 99% of requests complete within
AvgMean response time across all requests
RequestsTotal request count in the selected window
TrendDirection vs. the previous equivalent window

Use the time window selector to change the aggregation period:

WindowDescription
1 HourLast 60 minutes
24 HoursLast 24 hours
7 DaysLast 7 days
30 DaysLast 30 days

Click Refresh to reload the current window’s data.

The status indicator next to each p50 and p95 value reflects the following thresholds:

Statusp50 / p95 value
Healthy (green)< 200 ms
Warning (yellow)200–499 ms
Critical (red)≥ 500 ms

p99 and avg values are displayed without status indicators.

The Trend column shows the direction of change compared to the previous equivalent window:

IndicatorMeaning
↑ (red)Latency increased — degrading
↓ (green)Latency decreased — improving
— (grey)No significant change

A red upward trend on the p95 column for a provider indicates emerging latency that may trigger fallback routing.


The gateway uses a circuit breaker per provider to detect and isolate failing providers automatically.

StateMeaning
ClosedProvider is healthy. Requests are forwarded normally.
OpenProvider has failed health checks. Requests are routed to the fallback chain.
Half-openGateway is testing the provider with a single request. If it succeeds, the circuit closes. If it fails, the circuit remains open.

The Health score (0.0–1.0) reflects the provider’s recent reliability:

  • A score of 1.0 means no failures in the current window.
  • A score approaching 0.0 means most recent requests have failed.

The health score is derived from the failure rate across the rolling request window. A score below ~0.5 typically means the circuit breaker has opened or is about to open.

  • Failure count — consecutive failures since the last successful request. The circuit opens at 3 consecutive failures.
  • Failure rate — proportion of failed requests in the sliding window.

Recovery is automatic. After the circuit opens, the gateway checks the provider at intervals. When 5 consecutive health checks pass, the circuit closes and the provider re-enters normal routing.


Provider filtering in the fallback manager

Section titled “Provider filtering in the fallback manager”

The fallback chain manager supports filtering to find models quickly:

  • Search — filter by model name, model ID, or provider name.
  • Provider filter — dropdown to show only models from a specific provider.
  • Click Clear Filters to reset both filters.

The total number of matching models is shown next to the filter controls.


When designing fallback chains, consider:

Provider diversity — configure fallbacks across different providers (e.g., Anthropic primary → OpenAI fallback) rather than multiple models from the same provider. A provider-wide outage would otherwise exhaust the entire chain.

Capability parity — fallback models should support the same capabilities required by your workload (streaming, function calling, context length). Routing to a less capable model may cause application-level errors even when the model responds successfully.

Cost implications — fallback models may have different token pricing. Review provider costs in Admin → Providers → Model Catalog before configuring fallback chains across price tiers.

Latency expectations — a fallback to a higher-latency provider may affect user-facing response times. Monitor the latency window after configuring new chains to confirm acceptable performance.