Skip to content

Outpost PVC recovery and dead-letter replay

This runbook covers recovery from outpost PVC (Persistent Volume Claim) failures. The outpost uses three persistent storage areas:

PVC pathPurposeLoss impact
policy_cache/Cached policy bundle from management planeOutpost falls back to default policy until re-sync
audit_buffer/HMAC-chained audit events awaiting syncUnsynced audit events lost (RPO risk)
dead_letter/Failed SIEM delivery batches as JSONLSIEM gaps until replayed

1a. Connected mode (management plane reachable)

Section titled “1a. Connected mode (management plane reachable)”

When the outpost has mTLS connectivity to the platform management plane, the policy cache automatically re-syncs every 60 seconds.

Terminal window
# 1. Verify outpost pod is running
kubectl -n arbitex-outpost get pods
# 2. Check policy sync status
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
cat /data/policy_cache/policy_bundle.json | jq '.metadata.version'
# 3. If PVC was recreated, the sync worker will repopulate automatically
# Watch logs for confirmation:
kubectl -n arbitex-outpost logs deploy/arbitex-outpost -f | grep "policy_sync"
# Expected log line:
# INFO policy_sync: Bundle refreshed, version=<N>, rules=<count>

If sync fails (certificate expired, network issue):

Terminal window
# Check mTLS connectivity
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
curl -sf --cert /certs/client.pem --key /certs/client-key.pem \
https://$PLATFORM_MANAGEMENT_URL/v1/internal/policy-bundle | jq '.metadata'
# Common issues:
# - PLATFORM_MANAGEMENT_URL not set or unreachable
# - Client certificate expired → rotate via cert-manager or manually
# - CA bundle mismatch → verify CLOUD_CA_CERT_PATH

In air-gapped deployments, the outpost bootstraps from a bundled default policy.

Terminal window
# 1. The outpost uses scripts/default-policy-bundle.json on first boot
# when no cached bundle exists and PLATFORM_MANAGEMENT_URL is unset
# 2. To update policy in air-gap mode, copy a new bundle into the PVC:
kubectl -n arbitex-outpost cp \
new-policy-bundle.json \
arbitex-outpost-0:/data/policy_cache/policy_bundle.json
# 3. Restart the outpost to pick up the new bundle
kubectl -n arbitex-outpost rollout restart deployment/arbitex-outpost
# 4. Verify
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
cat /data/policy_cache/policy_bundle.json | jq '.metadata.version'

Generating an offline policy bundle:

Terminal window
# On a machine with management plane access:
curl -sf --cert client.pem --key client-key.pem \
https://platform.arbitex.io/v1/internal/policy-bundle \
-o policy-bundle-export.json
# Transfer to air-gapped environment via approved media
# Then apply as shown above

The audit buffer stores HMAC-SHA256 chained JSONL records. The background sync worker POSTs batches to the platform endpoint /v1/internal/outpost-audit-sync.

Terminal window
# Check buffer size
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
wc -l /data/audit_buffer/events.jsonl
# Check HMAC chain integrity
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
python -c "
from outpost.audit.chain import verify_chain
result = verify_chain('/data/audit_buffer/events.jsonl')
print(f'Valid: {result.valid}, Events: {result.count}')
"

If the PVC was lost and recreated, the buffer is empty. Any events generated between the last successful sync and the PVC loss are not recoverable from the outpost. Check the platform for the last synced event to identify the gap.

Terminal window
# On the platform side, find the last outpost sync timestamp
curl -sf https://api.arbitex.io/v1/admin/audit?source=outpost&limit=1&order=desc \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.[0].timestamp'

If the outpost generated events before PVC loss that were never synced, those events are lost. Document the gap window in the incident report.

To force an immediate sync of buffered events:

Terminal window
# Trigger manual sync
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
python -c "
from outpost.audit.sync import AuditSyncWorker
worker = AuditSyncWorker()
result = worker.flush()
print(f'Synced: {result.synced}, Failed: {result.failed}')
"

The audit buffer has a maximum of 100,000 entries (configurable via max_audit_buffer_entries). When full, the oldest entries are rotated out. Ensure the sync worker is running to prevent data loss from rotation.


Failed SIEM delivery batches are stored as JSONL files in dead_letter/. Each file represents a batch that failed to deliver to the configured SIEM endpoint (Splunk HEC or syslog).

Terminal window
# List dead-letter files
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
ls -la /data/dead_letter/
# Sample output:
# -rw-r--r-- 1 app app 245K Mar 10 14:22 batch-20260310-142200.jsonl
# -rw-r--r-- 1 app app 189K Mar 10 14:35 batch-20260310-143500.jsonl
# Inspect a file
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
head -3 /data/dead_letter/batch-20260310-142200.jsonl | jq .
Terminal window
# Replay a single batch
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
bash -c '
for line in $(cat /data/dead_letter/batch-20260310-142200.jsonl); do
curl -sf -X POST "$SIEM_DIRECT_SPLUNK_URL" \
-H "Authorization: Splunk $SIEM_DIRECT_SPLUNK_TOKEN" \
-H "Content-Type: application/json" \
-d "$line"
done
'
# Replay all dead-letter files
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
bash -c '
for f in /data/dead_letter/*.jsonl; do
echo "Replaying $f ..."
while IFS= read -r line; do
curl -sf -X POST "$SIEM_DIRECT_SPLUNK_URL" \
-H "Authorization: Splunk $SIEM_DIRECT_SPLUNK_TOKEN" \
-H "Content-Type: application/json" \
-d "$line"
done < "$f"
echo "Done: $f"
mv "$f" "$f.replayed"
done
'
Terminal window
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
bash -c '
for f in /data/dead_letter/*.jsonl; do
echo "Replaying $f to syslog ..."
while IFS= read -r line; do
logger -n "$SIEM_DIRECT_SYSLOG_HOST" \
-P "$SIEM_DIRECT_SYSLOG_PORT" \
--rfc5424 -p local0.info "$line"
done < "$f"
mv "$f" "$f.replayed"
done
'

In air-gapped environments, dead-letter files must be exported and replayed from a machine with SIEM access.

Terminal window
# 1. Export dead-letter files from outpost
kubectl -n arbitex-outpost cp \
arbitex-outpost-0:/data/dead_letter/ \
/tmp/dead-letter-export/
# 2. Transfer to SIEM-connected machine via approved media
# 3. Replay from the connected machine
for f in /tmp/dead-letter-export/*.jsonl; do
while IFS= read -r line; do
curl -sf -X POST "$SPLUNK_HEC_URL" \
-H "Authorization: Splunk $SPLUNK_TOKEN" \
-H "Content-Type: application/json" \
-d "$line"
done < "$f"
echo "Replayed: $f"
done

After successful replay, remove processed files:

Terminal window
kubectl -n arbitex-outpost exec deploy/arbitex-outpost -- \
rm /data/dead_letter/*.replayed

Use this checklist when a CSI driver failure or storage incident causes complete PVC loss.

  • Triage: Identify which PVCs were lost (kubectl get pvc -n arbitex-outpost)
  • Policy cache: Will auto-recover via sync (connected) or needs manual copy (air-gap) — see §1
  • Audit buffer: Check platform for last synced event timestamp — see §2b
  • Document gap: Record the time window of potential audit event loss
  • Dead-letter: If dead-letter PVC is lost, those SIEM batches are unrecoverable; document in incident report
  • Verify HMAC key: Confirm AUDIT_HMAC_KEY is still set (outpost fails fast without it)
  • Restart outpost: kubectl -n arbitex-outpost rollout restart deployment/arbitex-outpost
  • Monitor logs: Watch for successful policy sync and audit buffer creation
  • Replay: If dead-letter files were backed up, replay per §3
  • Incident report: File report documenting PVC loss cause, data gap window, and recovery actions