ADR-004: Redis for CSRF State (Not Server Memory)

Status: Accepted Date: 2026-03 Deciders: Platform team (platform-0043, T557)

Context

OAuth 2.0 authorization code flows and SAML SP-initiated login both require the server to generate and later verify a random state / RelayState parameter. This parameter is a CSRF mitigation: the server stores the expected value at login initiation and checks it on callback, rejecting any callback that does not present the matching value.

The naive implementation stores state tokens in the server process memory (a Python dict or equivalent). This works for a single-replica deployment, but breaks in any horizontally-scaled environment:

A callback request routed to a different replica than the one that initiated the login will not find the state token in memory.
The result is authentication failures (“invalid CSRF state”) in production under load balancer round-robin.
In-memory state is lost on process restart — any in-flight OAuth/SAML flows fail after a rolling deploy.

Additionally, in-memory state has no TTL mechanism — leaked or abandoned states accumulate indefinitely.

Decision

OAuth and SAML CSRF state tokens are stored in Redis, not in server memory.

Key format: oauth:state:{state} and saml:state:{nonce}
TTL: 600 seconds (10 minutes) — after which the key is automatically deleted by Redis
Single-use enforcement: Tokens are retrieved and deleted atomically using a Redis pipeline (GET + DEL in one round-trip). If two requests race for the same state, only one will see a non-null value; the second receives None and is rejected.
Fail-closed: If Redis is unavailable when a state must be stored or consumed, the store raises an error and the OAuth/SAML flow is rejected with HTTP 503 rather than silently allowing the flow to proceed unsafely.

Consequences

Positive:

Works correctly in multi-replica deployments — any replica can validate the state token regardless of which replica initiated the login.
Automatic expiry via Redis TTL prevents state accumulation.
Atomic single-use enforcement (pipeline GET+DEL) prevents replay attacks.
State survives process restarts — rolling deploys no longer cause in-flight login failures.
Consistent with other Redis usage in the platform (session store, rate limiting).

Negative / trade-offs:

Adds a hard dependency on Redis for OAuth and SAML login. If Redis is unavailable, login is unavailable. Previously, login was resilient to Redis downtime (at the cost of correctness in multi-replica setups).
The fail-closed design means Redis flaps cause login service interruptions. Redis must be configured with appropriate HA (Redis Sentinel or Redis Cluster) for production.
Requires REDIS_URL to be set for Google OAuth and SAML SSO features to be enabled. Missing REDIS_URL disables these features with a logged warning.

Configuration:

REDIS_URL — Redis connection URL (required for OAuth and SAML state stores).