Disaster recovery runbook

This runbook covers disaster recovery for the Arbitex platform. All procedures target RTO 2 h / RPO 2 h as set by the PO.

For routine database migrations and rollback, see Alembic rollback and the platform docs/ops/alembic-rollback.md.

Severity decision tree

Use this tree to determine the correct recovery path.

Incident detected
├─ Data store unavailable?
│  ├─ PostgreSQL → §1 PostgreSQL recovery
│  ├─ Redis → §4 Redis recovery
│  └─ PVC (audit buffer / policy cache) → §3 PVC snapshot restore
├─ AKS cluster unreachable?
│  └─ §2 AKS cluster restore
└─ Multiple stores affected?
   └─ Follow sections in order: §1 → §3 → §4 → §2

Severity	Definition	RTO	RPO	Escalation
P0	Total platform outage — all tenants affected	1 h	1 h	Page on-call SRE + engineering lead
P1	Single data store down or degraded — partial tenant impact	2 h	2 h	Page on-call SRE
P2	Non-critical component degraded — no tenant data loss risk	4 h	4 h	Slack alert, next business day

1. PostgreSQL recovery

1a. Backup schedule

Method	Frequency	Retention	RPO coverage
`pg_dump` CronJob (Helm)	Every 2 h	7 days (168 snapshots)	2 h
Azure Backup for PostgreSQL Flexible Server	Continuous WAL + daily full	35 days	Minutes (PITR)
Manual pre-migration snapshot	Before each Alembic migration	Until next migration verified	Point-in-time

1b. Point-in-time recovery (PITR) — Azure Backup

Use PITR when you need to recover to a specific moment (e.g., data corruption detected).

# 1. Identify the target timestamp (UTC)
TARGET="2026-03-10T14:30:00Z"

# 2. Restore via Azure CLI
az postgres flexible-server restore \
  --resource-group arbitex-prod-rg \
  --name arbitex-db-restored \
  --source-server arbitex-db \
  --restore-point-in-time "$TARGET"

# 3. Verify restored data
psql -h arbitex-db-restored.postgres.database.azure.com \
  -U arbitex_admin -d arbitex \
  -c "SELECT count(*) FROM audit_events WHERE created_at > '$TARGET';"

# 4. Update platform connection string
kubectl -n arbitex set env deployment/arbitex-platform \
  DATABASE_URL="postgresql://arbitex_admin:${DB_PASSWORD}@arbitex-db-restored.postgres.database.azure.com:5432/arbitex?sslmode=require"

# 5. Restart platform pods
kubectl -n arbitex rollout restart deployment/arbitex-platform

# 6. Verify Alembic migration state matches
kubectl -n arbitex exec deploy/arbitex-platform -- \
  python -m alembic current

1c. pg_dump / pg_restore

Use when Azure Backup is unavailable or for cross-region restores.

# Restore from latest CronJob backup
LATEST=$(kubectl -n arbitex get pvc arbitex-db-backup -o jsonpath='{.metadata.annotations.latest-dump}')

# Copy dump from PVC
kubectl -n arbitex cp \
  arbitex-db-backup-pod:/backups/"$LATEST" \
  /tmp/arbitex-restore.sql.gz

# Restore to target database
gunzip -c /tmp/arbitex-restore.sql.gz | \
  psql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored

# Verify row counts on critical tables
psql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored -c "
  SELECT 'tenants' AS tbl, count(*) FROM tenants
  UNION ALL
  SELECT 'audit_events', count(*) FROM audit_events
  UNION ALL
  SELECT 'policy_rules', count(*) FROM policy_rules;
"

1d. Post-recovery checklist

Alembic current matches expected head revision
Tenant count matches pre-incident baseline
Audit event chain HMAC verification passes (/v1/admin/audit/verify)
All API health endpoints return 200
Update DNS / connection string if restored to new server

2. AKS cluster restore

2a. Prerequisites

Azure Backup for AKS enabled on the cluster
Velero or Azure Backup extension installed

2b. Restore procedure

# 1. List available backups
az dataprotection backup-instance list \
  --resource-group arbitex-prod-rg \
  --vault-name arbitex-backup-vault \
  --query "[?contains(name,'aks')]"

# 2. Trigger restore
az dataprotection backup-instance restore trigger \
  --resource-group arbitex-prod-rg \
  --vault-name arbitex-backup-vault \
  --backup-instance-name arbitex-aks-backup \
  --restore-request-object restore-config.json

# 3. Verify cluster health
kubectl get nodes
kubectl -n arbitex get pods

# 4. Verify critical deployments
kubectl -n arbitex rollout status deployment/arbitex-platform
kubectl -n arbitex rollout status deployment/arbitex-cloud

# 5. Re-apply any secrets not captured by backup
kubectl -n arbitex get secret arbitex-secrets -o yaml | grep -c 'data:'

2c. Alternative — fresh cluster + Helm redeploy

If AKS backup is unavailable or corrupted:

# 1. Create new AKS cluster
az aks create --resource-group arbitex-prod-rg \
  --name arbitex-aks-restored \
  --node-count 3 --generate-ssh-keys

# 2. Get credentials
az aks get-credentials --resource-group arbitex-prod-rg \
  --name arbitex-aks-restored

# 3. Re-apply secrets from Key Vault
./scripts/sync-keyvault-secrets.sh

# 4. Helm install
helm install arbitex ./charts/arbitex-platform \
  -f values-prod.yaml \
  --namespace arbitex --create-namespace \
  --atomic --timeout 10m

# 5. Restore database (§1) and verify

3. PVC snapshot restore

3a. Identify affected PVCs

kubectl -n arbitex get pvc
# Expected PVCs:
#   arbitex-db-data       — PostgreSQL data
#   arbitex-audit-buffer  — Audit event buffer
#   arbitex-redis-data    — Redis AOF/RDB

3b. Restore from Azure Disk snapshot

# 1. List snapshots
az snapshot list --resource-group arbitex-aks-nodes-rg \
  --query "[?contains(name,'arbitex')]" \
  --output table

# 2. Create disk from snapshot
az disk create --resource-group arbitex-aks-nodes-rg \
  --name arbitex-db-data-restored \
  --source arbitex-db-data-snap-20260310

# 3. Scale down the workload
kubectl -n arbitex scale deployment/arbitex-platform --replicas=0

# 4. Delete old PVC and PV (data already lost)
kubectl -n arbitex delete pvc arbitex-db-data
kubectl delete pv <old-pv-name>

# 5. Create new PV pointing to restored disk, then PVC
kubectl apply -f restored-pv.yaml
kubectl apply -f restored-pvc.yaml

# 6. Scale back up
kubectl -n arbitex scale deployment/arbitex-platform --replicas=3

3c. Post-restore verification

Pod mounts PVC successfully (kubectl describe pod)
Data integrity check for the specific store (SQL count, HMAC chain, etc.)
No CrashLoopBackOff on dependent pods

4. Redis recovery

4a. Backup schedule

Method	Frequency	Retention
Redis RDB snapshot	Every 1 h	24 snapshots
Redis AOF (append-only file)	Continuous	Current session

4b. Restore from RDB

# 1. Scale down Redis
kubectl -n arbitex scale statefulset/arbitex-redis --replicas=0

# 2. Copy RDB file to PVC
kubectl -n arbitex cp /tmp/dump.rdb arbitex-redis-0:/data/dump.rdb

# 3. Scale up
kubectl -n arbitex scale statefulset/arbitex-redis --replicas=1

# 4. Verify
kubectl -n arbitex exec arbitex-redis-0 -- redis-cli INFO keyspace

4c. Impact of Redis loss

Redis stores ephemeral caches (rate limits, session tokens, bloom filter state). Full loss means:

Active sessions invalidated — users must re-authenticate
Rate limit counters reset — brief window of no rate limiting
Bloom filter rebuilds on next startup (may take 30–60 s)

Redis loss does not cause data loss. All persistent state is in PostgreSQL.

5. Recovery validation

After any recovery, run the full validation suite:

# Health endpoints
curl -sf https://api.arbitex.io/health | jq .
curl -sf https://api.arbitex.io/v1/admin/audit/verify | jq .

# Smoke tests
kubectl -n arbitex exec deploy/arbitex-platform -- \
  python -m pytest tests/smoke/ -x --timeout=60

# Audit chain integrity
curl -sf https://api.arbitex.io/v1/admin/audit/verify \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.chain_valid'

Recovery timeline target

Phase	Target	Cumulative
Detection + triage	15 min	15 min
Restore data store(s)	60 min	1 h 15 min
Verify + smoke test	30 min	1 h 45 min
DNS / traffic cutover	15 min	2 h

6. Backup CronJob reference

Add to Helm values to enable automated pg_dump:

backup:
  enabled: true
  schedule: "0 */2 * * *"    # every 2 hours
  retention:
    count: 84                  # 7 days × 12 per day
  storage:
    pvc: arbitex-db-backup
    size: 50Gi
  command: |
    pg_dump -Fc -Z 6 \
      -h "$DATABASE_HOST" -U "$DATABASE_USER" "$DATABASE_NAME" \
      > /backups/arbitex-$(date +%Y%m%d-%H%M%S).dump

PVC snapshot policy (Azure):

az disk snapshot-policy create \
  --resource-group arbitex-aks-nodes-rg \
  --name arbitex-pvc-snap-policy \
  --schedule "every-2h" \
  --retention-count 84