Model routing configuration
Model routing determines which provider and model serves each AI request. When the primary model is unavailable or degraded, the gateway routes requests to the next available model in the fallback chain. This guide covers how to configure fallback chains, interpret latency thresholds, and read health-based routing state.
How routing works
Section titled “How routing works”Every request specifies a target model. The gateway resolves the request against the following routing layers in order:
- Policy routing — if an active policy rule has a
ROUTE_TOaction matching the request, the request is directed to the policy-specified model or tier. See Policy Engine testing for simulation patterns. - Primary model — if no policy routing applies, the request is forwarded to the provider hosting the requested model.
- Fallback chain — if the primary provider fails health checks (circuit breaker open) or returns a 5xx error, the gateway attempts each fallback in priority order.
If all models in the fallback chain are unavailable, the request returns a 503 Service Unavailable error.
Fallback chain configuration
Section titled “Fallback chain configuration”Navigate to Admin → Routing → Fallback Chains to manage per-model fallback chains.
How fallback chains work
Section titled “How fallback chains work”Each model has an independent fallback chain — an ordered list of alternative models to try when the primary fails. The gateway attempts fallbacks in priority order (priority 1 first, then 2, and so on).
A fallback is triggered when:
- The primary provider’s circuit breaker is open (3 or more consecutive failures).
- The primary provider returns a 5xx response code.
- The primary provider’s p95 latency exceeds the configured threshold.
Adding a fallback chain
Section titled “Adding a fallback chain”- Navigate to Admin → Routing → Fallback Chains.
- Locate the model by searching by name or filtering by provider.
- Click the model row to expand it.
- In the Select a model to add as fallback dropdown, choose the alternative model and provider.
- Click Add.
- Reorder entries using the up/down arrows. Lower priority numbers are tried first.
- Click Save Changes to persist.
The save button appears only when unsaved changes exist. Navigating away without saving discards changes.
Removing a fallback entry
Section titled “Removing a fallback entry”- Expand the model row.
- Click Remove next to the fallback entry.
- Click Save Changes.
Reordering fallback priority
Section titled “Reordering fallback priority”Each fallback entry has a priority number displayed in the leftmost column. Lower numbers are tried first (priority 1 before priority 2). Use the ▲/▼ buttons to move entries up or down. Priority numbers are automatically renumbered when entries are added, removed, or reordered.
Fallback chain configuration is also accessible via the API:
GET /api/providers/fallback/{model_id}PUT /api/providers/fallback/{model_id}PUT request body:
{ "model_id": "claude-3-5-sonnet-20241022", "fallbacks": [ { "model_id": "gpt-4o", "provider_name": "openai", "priority": 1 }, { "model_id": "gemini-1.5-pro", "provider_name": "google", "priority": 2 } ]}Fallback entries are evaluated in ascending priority order. Set fallbacks to an empty array to remove all fallbacks for a model.
Latency monitoring
Section titled “Latency monitoring”Navigate to Admin → Monitoring → Latency to view per-model latency percentiles.
Latency metrics
Section titled “Latency metrics”The latency monitor shows the following metrics for each model, aggregated over the selected time window:
| Metric | Description |
|---|---|
| p50 | 50th percentile (median) response time |
| p95 | 95th percentile — the latency that 95% of requests complete within |
| p99 | 99th percentile — the latency that 99% of requests complete within |
| Avg | Mean response time across all requests |
| Requests | Total request count in the selected window |
| Trend | Direction vs. the previous equivalent window |
Time windows
Section titled “Time windows”Use the time window selector to change the aggregation period:
| Window | Description |
|---|---|
| 1 Hour | Last 60 minutes |
| 24 Hours | Last 24 hours |
| 7 Days | Last 7 days |
| 30 Days | Last 30 days |
Click Refresh to reload the current window’s data.
Latency status thresholds
Section titled “Latency status thresholds”The status indicator next to each p50 and p95 value reflects the following thresholds:
| Status | p50 / p95 value |
|---|---|
| Healthy (green) | < 200 ms |
| Warning (yellow) | 200–499 ms |
| Critical (red) | ≥ 500 ms |
p99 and avg values are displayed without status indicators.
Trend indicators
Section titled “Trend indicators”The Trend column shows the direction of change compared to the previous equivalent window:
| Indicator | Meaning |
|---|---|
| ↑ (red) | Latency increased — degrading |
| ↓ (green) | Latency decreased — improving |
| — (grey) | No significant change |
A red upward trend on the p95 column for a provider indicates emerging latency that may trigger fallback routing.
Health-based routing
Section titled “Health-based routing”The gateway uses a circuit breaker per provider to detect and isolate failing providers automatically.
Circuit breaker states
Section titled “Circuit breaker states”| State | Meaning |
|---|---|
| Closed | Provider is healthy. Requests are forwarded normally. |
| Open | Provider has failed health checks. Requests are routed to the fallback chain. |
| Half-open | Gateway is testing the provider with a single request. If it succeeds, the circuit closes. If it fails, the circuit remains open. |
Health score
Section titled “Health score”The Health score (0.0–1.0) reflects the provider’s recent reliability:
- A score of 1.0 means no failures in the current window.
- A score approaching 0.0 means most recent requests have failed.
The health score is derived from the failure rate across the rolling request window. A score below ~0.5 typically means the circuit breaker has opened or is about to open.
Failure count and failure rate
Section titled “Failure count and failure rate”- Failure count — consecutive failures since the last successful request. The circuit opens at 3 consecutive failures.
- Failure rate — proportion of failed requests in the sliding window.
Recovery is automatic. After the circuit opens, the gateway checks the provider at intervals. When 5 consecutive health checks pass, the circuit closes and the provider re-enters normal routing.
Provider filtering in the fallback manager
Section titled “Provider filtering in the fallback manager”The fallback chain manager supports filtering to find models quickly:
- Search — filter by model name, model ID, or provider name.
- Provider filter — dropdown to show only models from a specific provider.
- Click Clear Filters to reset both filters.
The total number of matching models is shown next to the filter controls.
Planning fallback chains
Section titled “Planning fallback chains”When designing fallback chains, consider:
Provider diversity — configure fallbacks across different providers (e.g., Anthropic primary → OpenAI fallback) rather than multiple models from the same provider. A provider-wide outage would otherwise exhaust the entire chain.
Capability parity — fallback models should support the same capabilities required by your workload (streaming, function calling, context length). Routing to a less capable model may cause application-level errors even when the model responds successfully.
Cost implications — fallback models may have different token pricing. Review provider costs in Admin → Providers → Model Catalog before configuring fallback chains across price tiers.
Latency expectations — a fallback to a higher-latency provider may affect user-facing response times. Monitor the latency window after configuring new chains to confirm acceptable performance.
See also
Section titled “See also”- Provider management — configure providers and API credentials
- Policy Engine testing — simulate ROUTE_TO policy rules
- Kill switch — disable all model traffic immediately
- Portal operations — read-only routing views available to end users