Skip to content

Disaster recovery runbook

This runbook covers disaster recovery for the Arbitex platform. All procedures target RTO 2 h / RPO 2 h as set by the PO.

For routine database migrations and rollback, see Alembic rollback and the platform docs/ops/alembic-rollback.md.


Use this tree to determine the correct recovery path.

Incident detected
├─ Data store unavailable?
│ ├─ PostgreSQL → §1 PostgreSQL recovery
│ ├─ Redis → §4 Redis recovery
│ └─ PVC (audit buffer / policy cache) → §3 PVC snapshot restore
├─ AKS cluster unreachable?
│ └─ §2 AKS cluster restore
└─ Multiple stores affected?
└─ Follow sections in order: §1 → §3 → §4 → §2
SeverityDefinitionRTORPOEscalation
P0Total platform outage — all tenants affected1 h1 hPage on-call SRE + engineering lead
P1Single data store down or degraded — partial tenant impact2 h2 hPage on-call SRE
P2Non-critical component degraded — no tenant data loss risk4 h4 hSlack alert, next business day

MethodFrequencyRetentionRPO coverage
pg_dump CronJob (Helm)Every 2 h7 days (168 snapshots)2 h
Azure Backup for PostgreSQL Flexible ServerContinuous WAL + daily full35 daysMinutes (PITR)
Manual pre-migration snapshotBefore each Alembic migrationUntil next migration verifiedPoint-in-time

1b. Point-in-time recovery (PITR) — Azure Backup

Section titled “1b. Point-in-time recovery (PITR) — Azure Backup”

Use PITR when you need to recover to a specific moment (e.g., data corruption detected).

Terminal window
# 1. Identify the target timestamp (UTC)
TARGET="2026-03-10T14:30:00Z"
# 2. Restore via Azure CLI
az postgres flexible-server restore \
--resource-group arbitex-prod-rg \
--name arbitex-db-restored \
--source-server arbitex-db \
--restore-point-in-time "$TARGET"
# 3. Verify restored data
psql -h arbitex-db-restored.postgres.database.azure.com \
-U arbitex_admin -d arbitex \
-c "SELECT count(*) FROM audit_events WHERE created_at > '$TARGET';"
# 4. Update platform connection string
kubectl -n arbitex set env deployment/arbitex-platform \
DATABASE_URL="postgresql://arbitex_admin:${DB_PASSWORD}@arbitex-db-restored.postgres.database.azure.com:5432/arbitex?sslmode=require"
# 5. Restart platform pods
kubectl -n arbitex rollout restart deployment/arbitex-platform
# 6. Verify Alembic migration state matches
kubectl -n arbitex exec deploy/arbitex-platform -- \
python -m alembic current

Use when Azure Backup is unavailable or for cross-region restores.

Terminal window
# Restore from latest CronJob backup
LATEST=$(kubectl -n arbitex get pvc arbitex-db-backup -o jsonpath='{.metadata.annotations.latest-dump}')
# Copy dump from PVC
kubectl -n arbitex cp \
arbitex-db-backup-pod:/backups/"$LATEST" \
/tmp/arbitex-restore.sql.gz
# Restore to target database
gunzip -c /tmp/arbitex-restore.sql.gz | \
psql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored
# Verify row counts on critical tables
psql -h "$DB_HOST" -U arbitex_admin -d arbitex_restored -c "
SELECT 'tenants' AS tbl, count(*) FROM tenants
UNION ALL
SELECT 'audit_events', count(*) FROM audit_events
UNION ALL
SELECT 'policy_rules', count(*) FROM policy_rules;
"
  • Alembic current matches expected head revision
  • Tenant count matches pre-incident baseline
  • Audit event chain HMAC verification passes (/v1/admin/audit/verify)
  • All API health endpoints return 200
  • Update DNS / connection string if restored to new server

  • Azure Backup for AKS enabled on the cluster
  • Velero or Azure Backup extension installed
Terminal window
# 1. List available backups
az dataprotection backup-instance list \
--resource-group arbitex-prod-rg \
--vault-name arbitex-backup-vault \
--query "[?contains(name,'aks')]"
# 2. Trigger restore
az dataprotection backup-instance restore trigger \
--resource-group arbitex-prod-rg \
--vault-name arbitex-backup-vault \
--backup-instance-name arbitex-aks-backup \
--restore-request-object restore-config.json
# 3. Verify cluster health
kubectl get nodes
kubectl -n arbitex get pods
# 4. Verify critical deployments
kubectl -n arbitex rollout status deployment/arbitex-platform
kubectl -n arbitex rollout status deployment/arbitex-cloud
# 5. Re-apply any secrets not captured by backup
kubectl -n arbitex get secret arbitex-secrets -o yaml | grep -c 'data:'

2c. Alternative — fresh cluster + Helm redeploy

Section titled “2c. Alternative — fresh cluster + Helm redeploy”

If AKS backup is unavailable or corrupted:

Terminal window
# 1. Create new AKS cluster
az aks create --resource-group arbitex-prod-rg \
--name arbitex-aks-restored \
--node-count 3 --generate-ssh-keys
# 2. Get credentials
az aks get-credentials --resource-group arbitex-prod-rg \
--name arbitex-aks-restored
# 3. Re-apply secrets from Key Vault
./scripts/sync-keyvault-secrets.sh
# 4. Helm install
helm install arbitex ./charts/arbitex-platform \
-f values-prod.yaml \
--namespace arbitex --create-namespace \
--atomic --timeout 10m
# 5. Restore database (§1) and verify

Terminal window
kubectl -n arbitex get pvc
# Expected PVCs:
# arbitex-db-data — PostgreSQL data
# arbitex-audit-buffer — Audit event buffer
# arbitex-redis-data — Redis AOF/RDB
Terminal window
# 1. List snapshots
az snapshot list --resource-group arbitex-aks-nodes-rg \
--query "[?contains(name,'arbitex')]" \
--output table
# 2. Create disk from snapshot
az disk create --resource-group arbitex-aks-nodes-rg \
--name arbitex-db-data-restored \
--source arbitex-db-data-snap-20260310
# 3. Scale down the workload
kubectl -n arbitex scale deployment/arbitex-platform --replicas=0
# 4. Delete old PVC and PV (data already lost)
kubectl -n arbitex delete pvc arbitex-db-data
kubectl delete pv <old-pv-name>
# 5. Create new PV pointing to restored disk, then PVC
kubectl apply -f restored-pv.yaml
kubectl apply -f restored-pvc.yaml
# 6. Scale back up
kubectl -n arbitex scale deployment/arbitex-platform --replicas=3
  • Pod mounts PVC successfully (kubectl describe pod)
  • Data integrity check for the specific store (SQL count, HMAC chain, etc.)
  • No CrashLoopBackOff on dependent pods

MethodFrequencyRetention
Redis RDB snapshotEvery 1 h24 snapshots
Redis AOF (append-only file)ContinuousCurrent session
Terminal window
# 1. Scale down Redis
kubectl -n arbitex scale statefulset/arbitex-redis --replicas=0
# 2. Copy RDB file to PVC
kubectl -n arbitex cp /tmp/dump.rdb arbitex-redis-0:/data/dump.rdb
# 3. Scale up
kubectl -n arbitex scale statefulset/arbitex-redis --replicas=1
# 4. Verify
kubectl -n arbitex exec arbitex-redis-0 -- redis-cli INFO keyspace

Redis stores ephemeral caches (rate limits, session tokens, bloom filter state). Full loss means:

  • Active sessions invalidated — users must re-authenticate
  • Rate limit counters reset — brief window of no rate limiting
  • Bloom filter rebuilds on next startup (may take 30–60 s)

Redis loss does not cause data loss. All persistent state is in PostgreSQL.


After any recovery, run the full validation suite:

Terminal window
# Health endpoints
curl -sf https://api.arbitex.io/health | jq .
curl -sf https://api.arbitex.io/v1/admin/audit/verify | jq .
# Smoke tests
kubectl -n arbitex exec deploy/arbitex-platform -- \
python -m pytest tests/smoke/ -x --timeout=60
# Audit chain integrity
curl -sf https://api.arbitex.io/v1/admin/audit/verify \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.chain_valid'
PhaseTargetCumulative
Detection + triage15 min15 min
Restore data store(s)60 min1 h 15 min
Verify + smoke test30 min1 h 45 min
DNS / traffic cutover15 min2 h

Add to Helm values to enable automated pg_dump:

backup:
enabled: true
schedule: "0 */2 * * *" # every 2 hours
retention:
count: 84 # 7 days × 12 per day
storage:
pvc: arbitex-db-backup
size: 50Gi
command: |
pg_dump -Fc -Z 6 \
-h "$DATABASE_HOST" -U "$DATABASE_USER" "$DATABASE_NAME" \
> /backups/arbitex-$(date +%Y%m%d-%H%M%S).dump

PVC snapshot policy (Azure):

Terminal window
az disk snapshot-policy create \
--resource-group arbitex-aks-nodes-rg \
--name arbitex-pvc-snap-policy \
--schedule "every-2h" \
--retention-count 84