Security Hardening — Epic M Features

Epic M (platform-0043) shipped five security hardening features that strengthen the Arbitex Platform’s secrets management, CSRF protection, DLP reliability, mTLS chain verification, and SCIM token isolation. This guide explains what each feature does, how to configure it, and how to verify it is working correctly.

1. Azure Key Vault Secrets Backend

What it does

By default the platform reads sensitive configuration values (database URLs, JWT keys, etc.) from environment variables. The pluggable secrets backend lets you source these values from Azure Key Vault instead, keeping production secrets out of container environment variables and Kubernetes manifests.

Three additional backends are supported alongside the default:

Backend	Description
`env` (default)	Reads from environment variables — existing behaviour, no change required.
`vault` / `azure_keyvault`	Reads from Azure Key Vault. Both values select the same backend.
`file`	Reads from a JSON file at `SECRETS_FILE` — useful for local development.

Configuration

# Select the Azure Key Vault backend
SECRETS_BACKEND=vault

# Vault URL (one of the two forms)
AZURE_VAULT_URL=https://myvault.vault.azure.net
# or
AZURE_KEYVAULT_URL=https://myvault.vault.azure.net

Authentication uses DefaultAzureCredential, which automatically picks up:

Azure Managed Identity (AKS pod identity / workload identity — recommended for production)
AZURE_CLIENT_ID + AZURE_TENANT_ID + AZURE_CLIENT_SECRET environment variables (service principal)
Azure CLI credentials (az login — local development)

Secret name translation: Azure Key Vault names cannot contain underscores. The backend automatically translates Python-style names: jwt_secret_key → jwt-secret-key. No manual renaming is required.

Startup behaviour

On startup the platform probes the vault with a connectivity check. If the vault URL is missing or unreachable:

The application fails fast with a RuntimeError — it will not start.
This prevents silent startup with empty or default secrets.

Verification

Check the platform startup logs for:

Azure Key Vault backend initialised for https://myvault.vault.azure.net

If the vault is unreachable you will see:

Azure Key Vault at https://myvault.vault.azure.net is unreachable at startup: ...

2. CSRF Protection — OAuth and SAML State in Redis

What it does

OAuth 2.0 and SAML SP-initiated login flows both use a random state / RelayState parameter to prevent CSRF attacks. These one-time tokens are now stored in Redis rather than server memory. This provides:

Correctness in multi-replica deployments — server memory CSRF state breaks when requests are routed to different replicas. Redis provides a shared, consistent store.
Automatic expiry — tokens expire after 10 minutes via Redis TTL. Expired tokens are cleaned up automatically.
Single-use enforcement — each token is retrieved and deleted atomically in a Redis pipeline (GET + DEL). Replay attacks are rejected because the key is gone after first use.
Fail-closed — if Redis is unavailable, the store raises an error and the OAuth/SAML flow is rejected with HTTP 503. The platform will not fall back to a less-secure mechanism.

Configuration

No additional configuration is required beyond setting REDIS_URL. Both OAuth and SAML CSRF stores initialise automatically from the application settings at startup.

REDIS_URL=redis://redis:6379/0

Variable	Required	Description
`REDIS_URL`	Yes	Redis connection URL. Required for Google OAuth and SAML login flows.

Key format:

OAuth states: oauth:state:{state_token}
SAML nonces: saml:state:{nonce}

TTL: 600 seconds (10 minutes) for both OAuth and SAML state tokens.

Verification

After a successful OAuth or SAML login attempt, confirm in Redis that no stale keys remain:

redis-cli keys "oauth:state:*"   # should be empty after login completes
redis-cli keys "saml:state:*"    # should be empty after login completes

If Redis is not configured (REDIS_URL empty), the platform logs:

REDIS_URL is not set — OAuth CSRF state store is disabled. Google OAuth will be unavailable until Redis is configured.

Google OAuth and SAML login will return HTTP 503 until Redis is configured.

3. DLP Inference Fail-Closed Mode

What it does

The Platform hosts GPU-accelerated DLP inference microservices (NER and DeBERTa contextual classification). When a microservice is unavailable (crash, OOM, slow startup), the platform must decide whether to allow the request through unscanned or block it.

Fail-closed (closed, default): When the inference microservice is unavailable, the DLP check is treated as a block — the request is denied rather than passed unscanned. This is the safe production default; it prevents sensitive data from reaching AI providers when DLP is degraded.

Fail-open (open): Allows requests through when the inference microservice is unavailable, without DLP scanning. This matches the legacy behaviour and is suitable only for environments where availability takes precedence over data protection.

Configuration

DLP_INFERENCE_FAIL_MODE=closed   # default — block on GPU microservice unavailability
# or
DLP_INFERENCE_FAIL_MODE=open     # legacy — allow unscanned (not recommended for production)

Variable	Default	Values	Description
`DLP_INFERENCE_FAIL_MODE`	`closed`	`closed`, `open`	Controls MicroserviceNERDetector and DeBERTaValidatorClient behaviour when the inference microservice is unreachable.

Verification

With DLP_INFERENCE_FAIL_MODE=closed, when the NER microservice is stopped, requests that would trigger DLP scanning should receive an error response rather than passing through. Check the platform logs for:

DLP inference unavailable — failing closed (blocking request)

4. mTLS Full Chain Verification

What it does

The Platform internal API routes (/v1/internal/) use mTLS to authenticate inbound calls from Hybrid Outposts and other internal services. Previously, verification required specifying a root CA certificate path and optionally an intermediate certificate path as separate files.

Full chain verification (MTLS_CA_BUNDLE) allows a single PEM file containing the complete CA chain (root + optional intermediates concatenated) to be used for trust anchoring. When set:

The bundle is parsed into a trusted CA pool.
Leaf certificate chains are verified against the pool.
MTLS_CA_BUNDLE takes precedence over the legacy CLOUD_CA_CERT_PATH + CLOUD_CA_INTERMEDIATE_PATH configuration.

This simplifies certificate rotation (update one bundle file) and supports complex PKI hierarchies (multi-level intermediate chains).

Configuration

# New — full chain bundle (recommended)
MTLS_CA_BUNDLE=/etc/arbitex/certs/ca-bundle.pem

# Legacy — separate root and optional intermediate (still supported)
CLOUD_CA_CERT_PATH=/etc/arbitex/certs/root-ca.pem
CLOUD_CA_INTERMEDIATE_PATH=/etc/arbitex/certs/intermediate-ca.pem  # optional

Variable	Default	Description
`MTLS_CA_BUNDLE`	`""`	Path to a PEM file containing one or more CA certificates (root + intermediates). Takes precedence over `CLOUD_CA_CERT_PATH` when set.
`CLOUD_CA_CERT_PATH`	`""`	Path to root CA PEM (legacy). Used when `MTLS_CA_BUNDLE` is not set.
`CLOUD_CA_INTERMEDIATE_PATH`	`""`	Path to intermediate CA PEM (legacy, optional). Used when `MTLS_CA_BUNDLE` is not set.

Development note: If neither MTLS_CA_BUNDLE nor CLOUD_CA_CERT_PATH is set, internal endpoints are unavailable (HTTP 503) unless the development bypass is explicitly enabled via internal support configuration. Never disable mTLS for production deployments.

Verification

After setting MTLS_CA_BUNDLE, confirm an outpost heartbeat is accepted:

curl --cert outpost.pem --key outpost.key --cacert ca-bundle.pem \
  https://platform/v1/orgs/{org_id}/outposts/{outpost_id}/heartbeat \
  -H "Content-Type: application/json" -d '{"version":"0.1.0","uptime":60,...}'
# Expect HTTP 200

5. SCIM Per-Org Tokens

What it does

SCIM 2.0 provisioning (for syncing users and groups from an IdP like Okta or Azure AD) previously used a global bearer token shared across all organisations. A compromised token could expose provisioning access for every tenant.

Per-org SCIM tokens give each organisation an independently rotatable bearer token stored as a bcrypt hash. Key properties:

The raw token is returned once at rotation time and cannot be recovered from the database.
Only the bcrypt hash is persisted (org_scim_tokens table).
One active token per org at a time — the active token has rotated_at IS NULL.
Previous tokens are preserved with a rotated_at timestamp for audit history.
Rotating one org’s token has zero effect on any other org.

CRUD API

Rotate (generate new) token:

POST /v1/scim/orgs/{org_id}/token/rotate
Authorization: Bearer <admin-user-JWT>

Response (raw token shown once):

{
  "token": "arb_scim_<random>",
  "created_at": "2026-03-12T14:00:00Z"
}

List tokens (history):

GET /v1/scim/orgs/{org_id}/tokens
Authorization: Bearer <admin-user-JWT>

Returns all token records (active + historical). Active token has rotated_at: null.

Delete all tokens (emergency revocation):

DELETE /v1/scim/orgs/{org_id}/tokens
Authorization: Bearer <admin-user-JWT>

Deactivates all tokens for the org. SCIM provisioning is suspended until a new token is rotated.

Migration from global token

If you previously used the global SCIM_BEARER_TOKEN environment variable:

In the Admin Portal, navigate to Settings → SCIM for each org.
Click Rotate Token to generate a per-org token.
Update your IdP configuration (Okta, Azure AD, etc.) with the new token.
Repeat for each org.
After all orgs have been migrated, remove SCIM_BEARER_TOKEN from your platform environment.

The platform accepts per-org tokens independently of the global token. Both can coexist during migration.

Verification

After rotating a token, the SCIM provisioning endpoint should accept requests authenticated with the new token:

curl https://platform/v1/scim/v2/Users \
  -H "Authorization: Bearer arb_scim_<your-token>"
# Expect HTTP 200 with user list

Requests using the old (pre-rotation) token should receive HTTP 401.