Outpost PVC recovery and dead-letter replay
This runbook covers recovery from outpost PVC (Persistent Volume Claim) failures. The outpost uses three persistent storage areas:
| PVC path | Purpose | Loss impact |
|---|---|---|
policy_cache/ | Cached policy bundle from management plane | Outpost falls back to default policy until re-sync |
audit_buffer/ | HMAC-chained audit events awaiting sync | Unsynced audit events lost (RPO risk) |
dead_letter/ | Failed SIEM delivery batches as JSONL | SIEM gaps until replayed |
1. Policy bundle recovery
Section titled “1. Policy bundle recovery”1a. Connected mode (management plane reachable)
Section titled “1a. Connected mode (management plane reachable)”When the outpost has mTLS connectivity to the platform management plane, the policy cache automatically re-syncs every 60 seconds.
# 1. Verify outpost pod is runningkubectl -n arbitex-outpost get pods
# 2. Check policy sync statuskubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ cat /data/policy_cache/policy_bundle.json | jq '.metadata.version'
# 3. If PVC was recreated, the sync worker will repopulate automatically# Watch logs for confirmation:kubectl -n arbitex-outpost logs deploy/arbitex-outpost -f | grep "policy_sync"
# Expected log line:# INFO policy_sync: Bundle refreshed, version=<N>, rules=<count>If sync fails (certificate expired, network issue):
# Check mTLS connectivitykubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ curl -sf --cert /certs/client.pem --key /certs/client-key.pem \ https://$PLATFORM_MANAGEMENT_URL/v1/internal/policy-bundle | jq '.metadata'
# Common issues:# - PLATFORM_MANAGEMENT_URL not set or unreachable# - Client certificate expired → rotate via cert-manager or manually# - CA bundle mismatch → verify CLOUD_CA_CERT_PATH1b. Air-gap mode (no management plane)
Section titled “1b. Air-gap mode (no management plane)”In air-gapped deployments, the outpost bootstraps from a bundled default policy.
# 1. The outpost uses scripts/default-policy-bundle.json on first boot# when no cached bundle exists and PLATFORM_MANAGEMENT_URL is unset
# 2. To update policy in air-gap mode, copy a new bundle into the PVC:kubectl -n arbitex-outpost cp \ new-policy-bundle.json \ arbitex-outpost-0:/data/policy_cache/policy_bundle.json
# 3. Restart the outpost to pick up the new bundlekubectl -n arbitex-outpost rollout restart deployment/arbitex-outpost
# 4. Verifykubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ cat /data/policy_cache/policy_bundle.json | jq '.metadata.version'Generating an offline policy bundle:
# On a machine with management plane access:curl -sf --cert client.pem --key client-key.pem \ https://platform.arbitex.io/v1/internal/policy-bundle \ -o policy-bundle-export.json
# Transfer to air-gapped environment via approved media# Then apply as shown above2. Audit buffer recovery
Section titled “2. Audit buffer recovery”The audit buffer stores HMAC-SHA256 chained JSONL records. The background sync worker POSTs batches to the platform endpoint /v1/internal/outpost-audit-sync.
2a. Verify buffer state
Section titled “2a. Verify buffer state”# Check buffer sizekubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ wc -l /data/audit_buffer/events.jsonl
# Check HMAC chain integritykubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ python -c "from outpost.audit.chain import verify_chainresult = verify_chain('/data/audit_buffer/events.jsonl')print(f'Valid: {result.valid}, Events: {result.count}')"2b. Drain buffer after PVC recovery
Section titled “2b. Drain buffer after PVC recovery”If the PVC was lost and recreated, the buffer is empty. Any events generated between the last successful sync and the PVC loss are not recoverable from the outpost. Check the platform for the last synced event to identify the gap.
# On the platform side, find the last outpost sync timestampcurl -sf https://api.arbitex.io/v1/admin/audit?source=outpost&limit=1&order=desc \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.[0].timestamp'If the outpost generated events before PVC loss that were never synced, those events are lost. Document the gap window in the incident report.
2c. Force buffer drain
Section titled “2c. Force buffer drain”To force an immediate sync of buffered events:
# Trigger manual synckubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ python -c "from outpost.audit.sync import AuditSyncWorkerworker = AuditSyncWorker()result = worker.flush()print(f'Synced: {result.synced}, Failed: {result.failed}')"2d. Buffer ring rotation
Section titled “2d. Buffer ring rotation”The audit buffer has a maximum of 100,000 entries (configurable via max_audit_buffer_entries). When full, the oldest entries are rotated out. Ensure the sync worker is running to prevent data loss from rotation.
3. Dead-letter replay
Section titled “3. Dead-letter replay”Failed SIEM delivery batches are stored as JSONL files in dead_letter/. Each file represents a batch that failed to deliver to the configured SIEM endpoint (Splunk HEC or syslog).
3a. Inspect dead-letter queue
Section titled “3a. Inspect dead-letter queue”# List dead-letter fileskubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ ls -la /data/dead_letter/
# Sample output:# -rw-r--r-- 1 app app 245K Mar 10 14:22 batch-20260310-142200.jsonl# -rw-r--r-- 1 app app 189K Mar 10 14:35 batch-20260310-143500.jsonl
# Inspect a filekubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ head -3 /data/dead_letter/batch-20260310-142200.jsonl | jq .3b. Replay against Splunk HEC
Section titled “3b. Replay against Splunk HEC”# Replay a single batchkubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ bash -c 'for line in $(cat /data/dead_letter/batch-20260310-142200.jsonl); do curl -sf -X POST "$SIEM_DIRECT_SPLUNK_URL" \ -H "Authorization: Splunk $SIEM_DIRECT_SPLUNK_TOKEN" \ -H "Content-Type: application/json" \ -d "$line"done'
# Replay all dead-letter fileskubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ bash -c 'for f in /data/dead_letter/*.jsonl; do echo "Replaying $f ..." while IFS= read -r line; do curl -sf -X POST "$SIEM_DIRECT_SPLUNK_URL" \ -H "Authorization: Splunk $SIEM_DIRECT_SPLUNK_TOKEN" \ -H "Content-Type: application/json" \ -d "$line" done < "$f" echo "Done: $f" mv "$f" "$f.replayed"done'3c. Replay against syslog
Section titled “3c. Replay against syslog”kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ bash -c 'for f in /data/dead_letter/*.jsonl; do echo "Replaying $f to syslog ..." while IFS= read -r line; do logger -n "$SIEM_DIRECT_SYSLOG_HOST" \ -P "$SIEM_DIRECT_SYSLOG_PORT" \ --rfc5424 -p local0.info "$line" done < "$f" mv "$f" "$f.replayed"done'3d. Air-gap SIEM replay
Section titled “3d. Air-gap SIEM replay”In air-gapped environments, dead-letter files must be exported and replayed from a machine with SIEM access.
# 1. Export dead-letter files from outpostkubectl -n arbitex-outpost cp \ arbitex-outpost-0:/data/dead_letter/ \ /tmp/dead-letter-export/
# 2. Transfer to SIEM-connected machine via approved media
# 3. Replay from the connected machinefor f in /tmp/dead-letter-export/*.jsonl; do while IFS= read -r line; do curl -sf -X POST "$SPLUNK_HEC_URL" \ -H "Authorization: Splunk $SPLUNK_TOKEN" \ -H "Content-Type: application/json" \ -d "$line" done < "$f" echo "Replayed: $f"done3e. Cleanup
Section titled “3e. Cleanup”After successful replay, remove processed files:
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \ rm /data/dead_letter/*.replayed4. Full PVC loss recovery checklist
Section titled “4. Full PVC loss recovery checklist”Use this checklist when a CSI driver failure or storage incident causes complete PVC loss.
- Triage: Identify which PVCs were lost (
kubectl get pvc -n arbitex-outpost) - Policy cache: Will auto-recover via sync (connected) or needs manual copy (air-gap) — see §1
- Audit buffer: Check platform for last synced event timestamp — see §2b
- Document gap: Record the time window of potential audit event loss
- Dead-letter: If dead-letter PVC is lost, those SIEM batches are unrecoverable; document in incident report
- Verify HMAC key: Confirm
AUDIT_HMAC_KEYis still set (outpost fails fast without it) - Restart outpost:
kubectl -n arbitex-outpost rollout restart deployment/arbitex-outpost - Monitor logs: Watch for successful policy sync and audit buffer creation
- Replay: If dead-letter files were backed up, replay per §3
- Incident report: File report documenting PVC loss cause, data gap window, and recovery actions