Security Hardening — Epic M Features
Security Hardening — Epic M Features
Section titled “Security Hardening — Epic M Features”Epic M (platform-0043) shipped five security hardening features that strengthen the Arbitex Platform’s secrets management, CSRF protection, DLP reliability, mTLS chain verification, and SCIM token isolation. This guide explains what each feature does, how to configure it, and how to verify it is working correctly.
1. Azure Key Vault Secrets Backend
Section titled “1. Azure Key Vault Secrets Backend”What it does
Section titled “What it does”By default the platform reads sensitive configuration values (database URLs, JWT keys, etc.) from environment variables. The pluggable secrets backend lets you source these values from Azure Key Vault instead, keeping production secrets out of container environment variables and Kubernetes manifests.
Three additional backends are supported alongside the default:
| Backend | Description |
|---|---|
env (default) | Reads from environment variables — existing behaviour, no change required. |
vault / azure_keyvault | Reads from Azure Key Vault. Both values select the same backend. |
file | Reads from a JSON file at SECRETS_FILE — useful for local development. |
Configuration
Section titled “Configuration”# Select the Azure Key Vault backendSECRETS_BACKEND=vault
# Vault URL (one of the two forms)AZURE_VAULT_URL=https://myvault.vault.azure.net# orAZURE_KEYVAULT_URL=https://myvault.vault.azure.netAuthentication uses DefaultAzureCredential, which automatically picks up:
- Azure Managed Identity (AKS pod identity / workload identity — recommended for production)
AZURE_CLIENT_ID+AZURE_TENANT_ID+AZURE_CLIENT_SECRETenvironment variables (service principal)- Azure CLI credentials (
az login— local development)
Secret name translation: Azure Key Vault names cannot contain underscores. The backend automatically translates Python-style names: jwt_secret_key → jwt-secret-key. No manual renaming is required.
Startup behaviour
Section titled “Startup behaviour”On startup the platform probes the vault with a connectivity check. If the vault URL is missing or unreachable:
- The application fails fast with a
RuntimeError— it will not start. - This prevents silent startup with empty or default secrets.
Verification
Section titled “Verification”Check the platform startup logs for:
Azure Key Vault backend initialised for https://myvault.vault.azure.netIf the vault is unreachable you will see:
Azure Key Vault at https://myvault.vault.azure.net is unreachable at startup: ...2. CSRF Protection — OAuth and SAML State in Redis
Section titled “2. CSRF Protection — OAuth and SAML State in Redis”What it does
Section titled “What it does”OAuth 2.0 and SAML SP-initiated login flows both use a random state / RelayState parameter to prevent CSRF attacks. These one-time tokens are now stored in Redis rather than server memory. This provides:
- Correctness in multi-replica deployments — server memory CSRF state breaks when requests are routed to different replicas. Redis provides a shared, consistent store.
- Automatic expiry — tokens expire after 10 minutes via Redis TTL. Expired tokens are cleaned up automatically.
- Single-use enforcement — each token is retrieved and deleted atomically in a Redis pipeline (
GET + DEL). Replay attacks are rejected because the key is gone after first use. - Fail-closed — if Redis is unavailable, the store raises an error and the OAuth/SAML flow is rejected with HTTP 503. The platform will not fall back to a less-secure mechanism.
Configuration
Section titled “Configuration”No additional configuration is required beyond setting REDIS_URL. Both OAuth and SAML CSRF stores initialise automatically from the application settings at startup.
REDIS_URL=redis://redis:6379/0| Variable | Required | Description |
|---|---|---|
REDIS_URL | Yes | Redis connection URL. Required for Google OAuth and SAML login flows. |
Key format:
- OAuth states:
oauth:state:{state_token} - SAML nonces:
saml:state:{nonce}
TTL: 600 seconds (10 minutes) for both OAuth and SAML state tokens.
Verification
Section titled “Verification”After a successful OAuth or SAML login attempt, confirm in Redis that no stale keys remain:
redis-cli keys "oauth:state:*" # should be empty after login completesredis-cli keys "saml:state:*" # should be empty after login completesIf Redis is not configured (REDIS_URL empty), the platform logs:
REDIS_URL is not set — OAuth CSRF state store is disabled. Google OAuth will be unavailable until Redis is configured.Google OAuth and SAML login will return HTTP 503 until Redis is configured.
3. DLP Inference Fail-Closed Mode
Section titled “3. DLP Inference Fail-Closed Mode”What it does
Section titled “What it does”The Platform hosts GPU-accelerated DLP inference microservices (NER and DeBERTa contextual classification). When a microservice is unavailable (crash, OOM, slow startup), the platform must decide whether to allow the request through unscanned or block it.
Fail-closed (closed, default): When the inference microservice is unavailable, the DLP check is treated as a block — the request is denied rather than passed unscanned. This is the safe production default; it prevents sensitive data from reaching AI providers when DLP is degraded.
Fail-open (open): Allows requests through when the inference microservice is unavailable, without DLP scanning. This matches the legacy behaviour and is suitable only for environments where availability takes precedence over data protection.
Configuration
Section titled “Configuration”DLP_INFERENCE_FAIL_MODE=closed # default — block on GPU microservice unavailability# orDLP_INFERENCE_FAIL_MODE=open # legacy — allow unscanned (not recommended for production)| Variable | Default | Values | Description |
|---|---|---|---|
DLP_INFERENCE_FAIL_MODE | closed | closed, open | Controls MicroserviceNERDetector and DeBERTaValidatorClient behaviour when the inference microservice is unreachable. |
Verification
Section titled “Verification”With DLP_INFERENCE_FAIL_MODE=closed, when the NER microservice is stopped, requests that would trigger DLP scanning should receive an error response rather than passing through. Check the platform logs for:
DLP inference unavailable — failing closed (blocking request)4. mTLS Full Chain Verification
Section titled “4. mTLS Full Chain Verification”What it does
Section titled “What it does”The Platform internal API routes (/v1/internal/) use mTLS to authenticate inbound calls from Hybrid Outposts and other internal services. Previously, verification required specifying a root CA certificate path and optionally an intermediate certificate path as separate files.
Full chain verification (MTLS_CA_BUNDLE) allows a single PEM file containing the complete CA chain (root + optional intermediates concatenated) to be used for trust anchoring. When set:
- The bundle is parsed into a trusted CA pool.
- Leaf certificate chains are verified against the pool.
MTLS_CA_BUNDLEtakes precedence over the legacyCLOUD_CA_CERT_PATH+CLOUD_CA_INTERMEDIATE_PATHconfiguration.
This simplifies certificate rotation (update one bundle file) and supports complex PKI hierarchies (multi-level intermediate chains).
Configuration
Section titled “Configuration”# New — full chain bundle (recommended)MTLS_CA_BUNDLE=/etc/arbitex/certs/ca-bundle.pem
# Legacy — separate root and optional intermediate (still supported)CLOUD_CA_CERT_PATH=/etc/arbitex/certs/root-ca.pemCLOUD_CA_INTERMEDIATE_PATH=/etc/arbitex/certs/intermediate-ca.pem # optional| Variable | Default | Description |
|---|---|---|
MTLS_CA_BUNDLE | "" | Path to a PEM file containing one or more CA certificates (root + intermediates). Takes precedence over CLOUD_CA_CERT_PATH when set. |
CLOUD_CA_CERT_PATH | "" | Path to root CA PEM (legacy). Used when MTLS_CA_BUNDLE is not set. |
CLOUD_CA_INTERMEDIATE_PATH | "" | Path to intermediate CA PEM (legacy, optional). Used when MTLS_CA_BUNDLE is not set. |
Development note: If neither MTLS_CA_BUNDLE nor CLOUD_CA_CERT_PATH is set, internal endpoints are unavailable (HTTP 503) unless the development bypass is explicitly enabled via internal support configuration. Never disable mTLS for production deployments.
Verification
Section titled “Verification”After setting MTLS_CA_BUNDLE, confirm an outpost heartbeat is accepted:
curl --cert outpost.pem --key outpost.key --cacert ca-bundle.pem \ https://platform/v1/orgs/{org_id}/outposts/{outpost_id}/heartbeat \ -H "Content-Type: application/json" -d '{"version":"0.1.0","uptime":60,...}'# Expect HTTP 2005. SCIM Per-Org Tokens
Section titled “5. SCIM Per-Org Tokens”What it does
Section titled “What it does”SCIM 2.0 provisioning (for syncing users and groups from an IdP like Okta or Azure AD) previously used a global bearer token shared across all organisations. A compromised token could expose provisioning access for every tenant.
Per-org SCIM tokens give each organisation an independently rotatable bearer token stored as a bcrypt hash. Key properties:
- The raw token is returned once at rotation time and cannot be recovered from the database.
- Only the bcrypt hash is persisted (
org_scim_tokenstable). - One active token per org at a time — the active token has
rotated_at IS NULL. - Previous tokens are preserved with a
rotated_attimestamp for audit history. - Rotating one org’s token has zero effect on any other org.
CRUD API
Section titled “CRUD API”Rotate (generate new) token:
POST /v1/scim/orgs/{org_id}/token/rotateAuthorization: Bearer <admin-user-JWT>Response (raw token shown once):
{ "token": "arb_scim_<random>", "created_at": "2026-03-12T14:00:00Z"}List tokens (history):
GET /v1/scim/orgs/{org_id}/tokensAuthorization: Bearer <admin-user-JWT>Returns all token records (active + historical). Active token has rotated_at: null.
Delete all tokens (emergency revocation):
DELETE /v1/scim/orgs/{org_id}/tokensAuthorization: Bearer <admin-user-JWT>Deactivates all tokens for the org. SCIM provisioning is suspended until a new token is rotated.
Migration from global token
Section titled “Migration from global token”If you previously used the global SCIM_BEARER_TOKEN environment variable:
- In the Admin Portal, navigate to Settings → SCIM for each org.
- Click Rotate Token to generate a per-org token.
- Update your IdP configuration (Okta, Azure AD, etc.) with the new token.
- Repeat for each org.
- After all orgs have been migrated, remove
SCIM_BEARER_TOKENfrom your platform environment.
The platform accepts per-org tokens independently of the global token. Both can coexist during migration.
Verification
Section titled “Verification”After rotating a token, the SCIM provisioning endpoint should accept requests authenticated with the new token:
curl https://platform/v1/scim/v2/Users \ -H "Authorization: Bearer arb_scim_<your-token>"# Expect HTTP 200 with user listRequests using the old (pre-rotation) token should receive HTTP 401.