Disaster recovery runbook
This runbook covers disaster recovery for the Arbitex platform. All procedures target RTO 2 h / RPO 2 h as set by the PO.
For routine database migrations and rollback, see Alembic rollback and the platform docs/ops/alembic-rollback.md.
Severity decision tree
Section titled “Severity decision tree”Use this tree to determine the correct recovery path.
Incident detected├─ Data store unavailable?│ ├─ PostgreSQL → §1 PostgreSQL recovery│ ├─ Redis → §4 Redis recovery│ └─ PVC (audit buffer / policy cache) → §3 PVC snapshot restore├─ AKS cluster unreachable?│ └─ §2 AKS cluster restore└─ Multiple stores affected? └─ Follow sections in order: §1 → §3 → §4 → §2| Severity | Definition | RTO | RPO | Escalation |
|---|---|---|---|---|
| P0 | Total platform outage — all tenants affected | 1 h | 1 h | Page on-call SRE + engineering lead |
| P1 | Single data store down or degraded — partial tenant impact | 2 h | 2 h | Page on-call SRE |
| P2 | Non-critical component degraded — no tenant data loss risk | 4 h | 4 h | Slack alert, next business day |
1. PostgreSQL recovery
Section titled “1. PostgreSQL recovery”1a. Backup schedule
Section titled “1a. Backup schedule”| Method | Frequency | Retention | RPO coverage |
|---|---|---|---|
pg_dump CronJob (Helm) | Every 2 h | 7 days (168 snapshots) | 2 h |
| Azure Backup for PostgreSQL Flexible Server | Continuous WAL + daily full | 35 days | Minutes (PITR) |
| Manual pre-migration snapshot | Before each Alembic migration | Until next migration verified | Point-in-time |
1b. Point-in-time recovery (PITR) — Azure Backup
Section titled “1b. Point-in-time recovery (PITR) — Azure Backup”Use PITR when you need to recover to a specific moment (e.g., data corruption detected).
# 1. Identify the target timestamp (UTC)TARGET="2026-03-10T14:30:00Z"
# 2. Restore via Azure CLIaz postgres flexible-server restore \ --resource-group arbitex-prod-rg \ --name arbitex-db-restored \ --source-server arbitex-db \ --restore-point-in-time "$TARGET"
# 3. Verify restored datapsql -h arbitex-db-restored.postgres.database.azure.com \ -U arbitex_admin -d arbitex \ -c "SELECT count(*) FROM audit_events WHERE created_at > '$TARGET';"
# 4. Update platform connection stringkubectl -n arbitex set env deployment/arbitex-platform \ DATABASE_URL="postgresql://arbitex_admin:${DB_PASSWORD}@arbitex-db-restored.postgres.database.azure.com:5432/arbitex?sslmode=require"
# 5. Restart platform podskubectl -n arbitex rollout restart deployment/arbitex-platform
# 6. Verify Alembic migration state matcheskubectl -n arbitex exec deploy/arbitex-platform -- \ python -m alembic current1c. pg_dump / pg_restore
Section titled “1c. pg_dump / pg_restore”Use when Azure Backup is unavailable or for cross-region restores.
# Restore from latest CronJob backupLATEST=$(kubectl -n arbitex get pvc arbitex-db-backup -o jsonpath='{.metadata.annotations.latest-dump}')
# Copy dump from PVCkubectl -n arbitex cp \ arbitex-db-backup-pod:/backups/"$LATEST" \ /tmp/arbitex-restore.sql.gz
# Restore to target databasegunzip -c /tmp/arbitex-restore.sql.gz | \ psql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored
# Verify row counts on critical tablespsql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored -c " SELECT 'tenants' AS tbl, count(*) FROM tenants UNION ALL SELECT 'audit_events', count(*) FROM audit_events UNION ALL SELECT 'policy_rules', count(*) FROM policy_rules;"1d. Post-recovery checklist
Section titled “1d. Post-recovery checklist”- Alembic
currentmatches expected head revision - Tenant count matches pre-incident baseline
- Audit event chain HMAC verification passes (
/v1/admin/audit/verify) - All API health endpoints return 200
- Update DNS / connection string if restored to new server
2. AKS cluster restore
Section titled “2. AKS cluster restore”2a. Prerequisites
Section titled “2a. Prerequisites”- Azure Backup for AKS enabled on the cluster
- Velero or Azure Backup extension installed
2b. Restore procedure
Section titled “2b. Restore procedure”# 1. List available backupsaz dataprotection backup-instance list \ --resource-group arbitex-prod-rg \ --vault-name arbitex-backup-vault \ --query "[?contains(name,'aks')]"
# 2. Trigger restoreaz dataprotection backup-instance restore trigger \ --resource-group arbitex-prod-rg \ --vault-name arbitex-backup-vault \ --backup-instance-name arbitex-aks-backup \ --restore-request-object restore-config.json
# 3. Verify cluster healthkubectl get nodeskubectl -n arbitex get pods
# 4. Verify critical deploymentskubectl -n arbitex rollout status deployment/arbitex-platformkubectl -n arbitex rollout status deployment/arbitex-cloud
# 5. Re-apply any secrets not captured by backupkubectl -n arbitex get secret arbitex-secrets -o yaml | grep -c 'data:'2c. Alternative — fresh cluster + Helm redeploy
Section titled “2c. Alternative — fresh cluster + Helm redeploy”If AKS backup is unavailable or corrupted:
# 1. Create new AKS clusteraz aks create --resource-group arbitex-prod-rg \ --name arbitex-aks-restored \ --node-count 3 --generate-ssh-keys
# 2. Get credentialsaz aks get-credentials --resource-group arbitex-prod-rg \ --name arbitex-aks-restored
# 3. Re-apply secrets from Key Vault./scripts/sync-keyvault-secrets.sh
# 4. Helm installhelm install arbitex ./charts/arbitex-platform \ -f values-prod.yaml \ --namespace arbitex --create-namespace \ --atomic --timeout 10m
# 5. Restore database (§1) and verify3. PVC snapshot restore
Section titled “3. PVC snapshot restore”3a. Identify affected PVCs
Section titled “3a. Identify affected PVCs”kubectl -n arbitex get pvc# Expected PVCs:# arbitex-db-data — PostgreSQL data# arbitex-audit-buffer — Audit event buffer# arbitex-redis-data — Redis AOF/RDB3b. Restore from Azure Disk snapshot
Section titled “3b. Restore from Azure Disk snapshot”# 1. List snapshotsaz snapshot list --resource-group arbitex-aks-nodes-rg \ --query "[?contains(name,'arbitex')]" \ --output table
# 2. Create disk from snapshotaz disk create --resource-group arbitex-aks-nodes-rg \ --name arbitex-db-data-restored \ --source arbitex-db-data-snap-20260310
# 3. Scale down the workloadkubectl -n arbitex scale deployment/arbitex-platform --replicas=0
# 4. Delete old PVC and PV (data already lost)kubectl -n arbitex delete pvc arbitex-db-datakubectl delete pv <old-pv-name>
# 5. Create new PV pointing to restored disk, then PVCkubectl apply -f restored-pv.yamlkubectl apply -f restored-pvc.yaml
# 6. Scale back upkubectl -n arbitex scale deployment/arbitex-platform --replicas=33c. Post-restore verification
Section titled “3c. Post-restore verification”- Pod mounts PVC successfully (
kubectl describe pod) - Data integrity check for the specific store (SQL count, HMAC chain, etc.)
- No
CrashLoopBackOffon dependent pods
4. Redis recovery
Section titled “4. Redis recovery”4a. Backup schedule
Section titled “4a. Backup schedule”| Method | Frequency | Retention |
|---|---|---|
| Redis RDB snapshot | Every 1 h | 24 snapshots |
| Redis AOF (append-only file) | Continuous | Current session |
4b. Restore from RDB
Section titled “4b. Restore from RDB”# 1. Scale down Rediskubectl -n arbitex scale statefulset/arbitex-redis --replicas=0
# 2. Copy RDB file to PVCkubectl -n arbitex cp /tmp/dump.rdb arbitex-redis-0:/data/dump.rdb
# 3. Scale upkubectl -n arbitex scale statefulset/arbitex-redis --replicas=1
# 4. Verifykubectl -n arbitex exec arbitex-redis-0 -- redis-cli INFO keyspace4c. Impact of Redis loss
Section titled “4c. Impact of Redis loss”Redis stores ephemeral caches (rate limits, session tokens, bloom filter state). Full loss means:
- Active sessions invalidated — users must re-authenticate
- Rate limit counters reset — brief window of no rate limiting
- Bloom filter rebuilds on next startup (may take 30–60 s)
Redis loss does not cause data loss. All persistent state is in PostgreSQL.
5. Recovery validation
Section titled “5. Recovery validation”After any recovery, run the full validation suite:
# Health endpointscurl -sf https://api.arbitex.io/health | jq .curl -sf https://api.arbitex.io/v1/admin/audit/verify | jq .
# Smoke testskubectl -n arbitex exec deploy/arbitex-platform -- \ python -m pytest tests/smoke/ -x --timeout=60
# Audit chain integritycurl -sf https://api.arbitex.io/v1/admin/audit/verify \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.chain_valid'Recovery timeline target
Section titled “Recovery timeline target”| Phase | Target | Cumulative |
|---|---|---|
| Detection + triage | 15 min | 15 min |
| Restore data store(s) | 60 min | 1 h 15 min |
| Verify + smoke test | 30 min | 1 h 45 min |
| DNS / traffic cutover | 15 min | 2 h |
6. Backup CronJob reference
Section titled “6. Backup CronJob reference”Add to Helm values to enable automated pg_dump:
backup: enabled: true schedule: "0 */2 * * *" # every 2 hours retention: count: 84 # 7 days × 12 per day storage: pvc: arbitex-db-backup size: 50Gi command: | pg_dump -Fc -Z 6 \ -h "$DATABASE_HOST" -U "$DATABASE_USER" "$DATABASE_NAME" \ > /backups/arbitex-$(date +%Y%m%d-%H%M%S).dumpPVC snapshot policy (Azure):
az disk snapshot-policy create \ --resource-group arbitex-aks-nodes-rg \ --name arbitex-pvc-snap-policy \ --schedule "every-2h" \ --retention-count 84