Alert Configuration & Monitoring
Alert Configuration & Monitoring
Section titled “Alert Configuration & Monitoring”This guide covers the full observability alerting stack for Arbitex AI Gateway: Prometheus alert rules, Grafana dashboard setup, notification channel configuration (PagerDuty, Slack, email), escalation policies, and custom alert creation.
Overview
Section titled “Overview”The Arbitex platform exposes a Prometheus /metrics endpoint on port 9090 (platform) and port 9091 (outpost). Alerts are evaluated by Prometheus Alertmanager and routed to notification channels based on severity and team ownership.
Key alert domains:
- DLP pipeline — latency, error rates, and NER/DeBERTa service health
- Budget enforcement — per-group spend thresholds and overage events
- TLS certificate expiry — platform and outpost mTLS certs
- Infrastructure — database connections, Redis availability, queue depth
Prometheus Alert Rules
Section titled “Prometheus Alert Rules”Alert rules are defined in YAML and loaded by Prometheus via the rule_files configuration key. Platform-0047 ships four production rule groups.
Installing Alert Rule Files
Section titled “Installing Alert Rule Files”Mount alert rule files into your Prometheus container:
prometheus: image: prom/prometheus:v2.51.0 volumes: - ./prometheus/rules:/etc/prometheus/rules command: - '--config.file=/etc/prometheus/prometheus.yml' - '--web.enable-lifecycle'Reference rules from prometheus.yml:
rule_files: - /etc/prometheus/rules/dlp_alerts.yml - /etc/prometheus/rules/budget_alerts.yml - /etc/prometheus/rules/cert_alerts.yml - /etc/prometheus/rules/infra_alerts.ymlFor Kubernetes deployments, store rules in a ConfigMap:
kubectl create configmap prometheus-rules \ --from-file=rules/ \ -n monitoringReference the ConfigMap in your Prometheus Operator PrometheusRule CRD or mount it as a volume.
DLP Latency Alerts (dlp_alerts.yml)
Section titled “DLP Latency Alerts (dlp_alerts.yml)”groups: - name: dlp_latency interval: 30s rules: # P95 DLP pipeline latency exceeds 500ms over 5 minutes - alert: DLPHighLatencyP95 expr: | histogram_quantile(0.95, rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m]) ) > 0.5 for: 5m labels: severity: warning team: platform annotations: summary: "DLP P95 latency above 500ms" description: > DLP pipeline P95 latency is {{ $value | humanizeDuration }} on instance {{ $labels.instance }}. Check NER service and DeBERTa worker health. runbook_url: "https://docs.arbitex.io/runbooks/dlp-latency"
# P99 DLP pipeline latency exceeds 2s — critical escalation - alert: DLPCriticalLatencyP99 expr: | histogram_quantile(0.99, rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m]) ) > 2.0 for: 2m labels: severity: critical team: platform annotations: summary: "DLP P99 latency critical (>2s)" description: > DLP P99 latency is {{ $value | humanizeDuration }}. Immediate investigation required — requests may be timing out.
# DeBERTa worker queue depth - alert: DeBERTaQueueDepthHigh expr: arbitex_deberta_queue_depth > 50 for: 3m labels: severity: warning team: ml annotations: summary: "DeBERTa worker queue backing up" description: > DeBERTa inference queue depth is {{ $value }} items on {{ $labels.instance }}. Consider scaling worker replicas.DLP Error Rate Alerts
Section titled “DLP Error Rate Alerts”groups: - name: dlp_errors interval: 30s rules: # DLP pipeline error rate > 1% over 5 minutes - alert: DLPHighErrorRate expr: | rate(arbitex_dlp_pipeline_errors_total[5m]) / rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.01 for: 5m labels: severity: warning team: platform annotations: summary: "DLP error rate above 1%" description: > DLP error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Check logs for pattern match failures or NER service timeouts.
# Complete DLP pipeline failure (error rate > 10%) - alert: DLPPipelineFailure expr: | rate(arbitex_dlp_pipeline_errors_total[5m]) / rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.10 for: 1m labels: severity: critical team: platform annotations: summary: "DLP pipeline near-total failure" description: > DLP error rate is {{ $value | humanizePercentage }}. Platform may be serving requests without DLP protection. Check DLP service pod status immediately.
# NER service unavailable - alert: NERServiceDown expr: up{job="arbitex-ner"} == 0 for: 1m labels: severity: critical team: ml annotations: summary: "NER service is down" description: "NER service on {{ $labels.instance }} has been down for 1 minute."Budget Threshold Alerts (budget_alerts.yml)
Section titled “Budget Threshold Alerts (budget_alerts.yml)”groups: - name: budget_enforcement interval: 60s rules: # Group approaching budget limit (80% of monthly) - alert: GroupBudgetWarning expr: | arbitex_group_spend_current_month_usd / arbitex_group_budget_limit_usd > 0.80 for: 5m labels: severity: warning team: finance annotations: summary: "Group {{ $labels.group_id }} at 80% budget" description: > Group {{ $labels.group_id }} ({{ $labels.group_name }}) has consumed {{ $value | humanizePercentage }} of its monthly budget (${{ $labels.limit_usd }}).
# Group exceeded budget - alert: GroupBudgetExceeded expr: | arbitex_group_spend_current_month_usd / arbitex_group_budget_limit_usd >= 1.0 for: 0m labels: severity: critical team: finance annotations: summary: "Group {{ $labels.group_id }} budget exceeded" description: > Group {{ $labels.group_id }} has exceeded its monthly budget. Traffic may be throttled per enforcement policy.
# Platform-wide spend spike (>200% of 7-day rolling average) - alert: PlatformSpendSpike expr: | rate(arbitex_platform_spend_usd_total[1h]) / rate(arbitex_platform_spend_usd_total[7d] offset 1h) > 2.0 for: 15m labels: severity: warning team: finance annotations: summary: "Platform spend rate anomaly detected" description: > Hourly spend rate is {{ $value }}x the 7-day average. Possible cost runaway event — check active groups.Certificate Expiry Alerts (cert_alerts.yml)
Section titled “Certificate Expiry Alerts (cert_alerts.yml)”groups: - name: cert_expiry interval: 6h rules: # mTLS cert expiring within 30 days - alert: OutpostMTLSCertExpiringSoon expr: | (arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time()) / 86400 < 30 for: 0m labels: severity: warning team: platform annotations: summary: "Outpost mTLS cert expires in < 30 days" description: > Outpost {{ $labels.outpost_id }} mTLS certificate expires in {{ $value | humanizeDuration }}. Rotate via PUT /api/admin/outposts/{id}/rotate-cert. runbook_url: "https://docs.arbitex.io/runbooks/cert-rotation"
# mTLS cert expiring within 7 days — critical - alert: OutpostMTLSCertCritical expr: | (arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time()) / 86400 < 7 for: 0m labels: severity: critical team: platform annotations: summary: "Outpost mTLS cert expires in < 7 days" description: > URGENT: Outpost {{ $labels.outpost_id }} mTLS certificate expires in {{ $value | humanizeDuration }}.
# Platform TLS cert (API gateway) - alert: PlatformTLSCertExpiringSoon expr: | (arbitex_platform_tls_cert_expiry_timestamp_seconds - time()) / 86400 < 30 for: 0m labels: severity: warning team: platform annotations: summary: "Platform TLS cert expires in < 30 days" description: > Platform API TLS certificate expires in {{ $value | humanizeDuration }}. Renew via your certificate authority or ACME provider.Infrastructure Alerts (infra_alerts.yml)
Section titled “Infrastructure Alerts (infra_alerts.yml)”groups: - name: infrastructure interval: 30s rules: # PostgreSQL connection pool exhaustion - alert: DatabaseConnectionPoolExhausted expr: | arbitex_db_pool_available_connections / arbitex_db_pool_max_connections < 0.10 for: 2m labels: severity: critical team: platform annotations: summary: "Database connection pool below 10%" description: > Only {{ $value | humanizePercentage }} of DB connections are available on {{ $labels.instance }}.
# Redis unavailable - alert: RedisUnavailable expr: up{job="arbitex-redis"} == 0 for: 1m labels: severity: critical team: platform annotations: summary: "Redis is unreachable" description: > Redis on {{ $labels.instance }} has been unreachable for 1 minute. Auth sessions and rate limiting may be degraded.
# Request queue depth - alert: RequestQueueDepthHigh expr: arbitex_request_queue_depth > 1000 for: 5m labels: severity: warning team: platform annotations: summary: "Request queue depth high" description: > Request queue depth is {{ $value }} items. Consider scaling platform replicas.Grafana Dashboard Setup
Section titled “Grafana Dashboard Setup”The platform ships six pre-built Grafana dashboards (platform-0046). Import them via the Grafana API or UI.
Dashboard Inventory
Section titled “Dashboard Inventory”| Dashboard | UID | Description |
|---|---|---|
| Platform Overview | arbitex-overview | Request rate, latency, error rate, active connections |
| DLP Analytics | arbitex-dlp | Pipeline stages, trigger rates, action distribution |
| Budget & Cost | arbitex-budget | Per-group spend, model cost breakdown, burn rate |
| Outpost Health | arbitex-outpost | Per-outpost request forwarding, latency, cert status |
| Model Performance | arbitex-models | Provider latency P50/P95/P99, error rates by model |
| Security & Audit | arbitex-security | Auth events, MFA usage, policy violations, SIEM events |
Importing Dashboards
Section titled “Importing Dashboards”Via Grafana UI:
- Navigate to Dashboards → Import
- Upload the JSON file or paste the dashboard JSON
- Set the data source to your Prometheus instance
- Click Import
Via Grafana API:
# Import a single dashboardcurl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $GRAFANA_API_TOKEN" \ -d @dashboards/arbitex-overview.json \ http://localhost:3000/api/dashboards/import
# Bulk import all dashboardsfor f in dashboards/*.json; do curl -s -X POST \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $GRAFANA_API_TOKEN" \ -d @"$f" \ http://localhost:3000/api/dashboards/importdoneVia Helm (recommended for Kubernetes):
# values.yaml for kube-prometheus-stackgrafana: dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: arbitex orgId: 1 folder: Arbitex type: file options: path: /var/lib/grafana/dashboards/arbitex dashboardsConfigMaps: arbitex: arbitex-grafana-dashboardsCreate the ConfigMap from the dashboard directory:
kubectl create configmap arbitex-grafana-dashboards \ --from-file=dashboards/ \ -n monitoringKey Dashboard Panels
Section titled “Key Dashboard Panels”Platform Overview panel queries:
# Request raterate(arbitex_http_requests_total[5m])
# P95 response timehistogram_quantile(0.95, rate(arbitex_http_request_duration_seconds_bucket[5m]))
# Error raterate(arbitex_http_requests_total{status=~"5.."}[5m]) / rate(arbitex_http_requests_total[5m])DLP Analytics panel queries:
# DLP trigger rate by actionrate(arbitex_dlp_triggers_total[5m])
# Actions by type (block/redact/allow/require_approval)sum by (action) (rate(arbitex_dlp_triggers_total[5m]))
# NER entity detection raterate(arbitex_ner_entities_detected_total[5m])Notification Channels
Section titled “Notification Channels”Alertmanager routes alerts to notification channels based on severity and label matchers.
Alertmanager Base Configuration
Section titled “Alertmanager Base Configuration”global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alerts@example.com' smtp_auth_username: 'alerts@example.com' smtp_auth_password: '${SMTP_PASSWORD}'
templates: - '/etc/alertmanager/templates/*.tmpl'
route: receiver: 'default-receiver' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts → PagerDuty - match: severity: critical receiver: pagerduty-critical group_wait: 0s repeat_interval: 1h continue: true
# Finance team alerts - match: team: finance receiver: slack-finance continue: true
# ML team alerts - match: team: ml receiver: slack-ml
# Warning alerts → Slack platform channel - match: severity: warning receiver: slack-platformPagerDuty Configuration
Section titled “PagerDuty Configuration”receivers: - name: pagerduty-critical pagerduty_configs: - routing_key: '${PAGERDUTY_ROUTING_KEY}' send_resolved: true severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' details: firing: '{{ .Alerts.Firing | len }}' resolved: '{{ .Alerts.Resolved | len }}' description: '{{ .CommonAnnotations.description }}' runbook: '{{ .CommonAnnotations.runbook_url }}' client: 'Arbitex Alertmanager' client_url: '{{ template "arbitex.alertmanagerURL" . }}'Required secrets:
# Create Alertmanager secret in Kuberneteskubectl create secret generic alertmanager-secrets \ --from-literal=PAGERDUTY_ROUTING_KEY=your-routing-key \ --from-literal=SMTP_PASSWORD=your-smtp-password \ -n monitoringSlack Configuration
Section titled “Slack Configuration”receivers: - name: slack-platform slack_configs: - api_url: '${SLACK_PLATFORM_WEBHOOK_URL}' send_resolved: true channel: '#platform-alerts' title: '{{ template "slack.arbitex.title" . }}' text: '{{ template "slack.arbitex.text" . }}' color: '{{ if eq .Status "firing" }}{{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}' actions: - type: button text: 'View in Grafana' url: '{{ .CommonAnnotations.runbook_url }}' - type: button text: 'Silence Alert' url: '{{ template "arbitex.silenceURL" . }}'
- name: slack-finance slack_configs: - api_url: '${SLACK_FINANCE_WEBHOOK_URL}' send_resolved: true channel: '#finance-alerts' title: 'Budget Alert: {{ .GroupLabels.alertname }}' text: '{{ .CommonAnnotations.description }}'
- name: slack-ml slack_configs: - api_url: '${SLACK_ML_WEBHOOK_URL}' send_resolved: true channel: '#ml-alerts'Slack alert template (/etc/alertmanager/templates/slack.tmpl):
{{ define "slack.arbitex.title" -}} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}{{- end }}
{{ define "slack.arbitex.text" -}} {{ range .Alerts -}} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} {{ if .Annotations.runbook_url -}} *Runbook:* {{ .Annotations.runbook_url }} {{- end }} {{ end -}}{{- end }}Email Configuration
Section titled “Email Configuration”receivers: - name: email-oncall email_configs: - to: 'oncall@example.com' send_resolved: true headers: subject: '[Arbitex Alert] {{ .GroupLabels.alertname }} - {{ .Status | toUpper }}' html: '{{ template "email.arbitex.html" . }}' require_tls: true
- name: default-receiver email_configs: - to: 'platform-team@example.com' send_resolved: trueEscalation Policies
Section titled “Escalation Policies”Define multi-tier escalation in Alertmanager routing with repeat_interval staggering.
Standard Escalation Matrix
Section titled “Standard Escalation Matrix”| Severity | Initial Notify | Escalate After | Escalate To |
|---|---|---|---|
info | Slack channel | — | — |
warning | Slack channel | 2 hours unresolved | On-call Slack DM |
critical | PagerDuty + Slack | 30 min unresolved | PagerDuty escalation policy |
Alertmanager Escalation Routing
Section titled “Alertmanager Escalation Routing”route: routes: # Critical → PagerDuty immediately - match: severity: critical receiver: pagerduty-critical group_wait: 0s repeat_interval: 30m
# Warning → Slack, escalate via inhibit rule after 2h - match: severity: warning receiver: slack-platform group_wait: 1m repeat_interval: 2h
# Inhibit rules — suppress warning if critical fires for same alertinhibit_rules: - source_match: severity: critical target_match: severity: warning equal: - alertname - instancePagerDuty Escalation Policy
Section titled “PagerDuty Escalation Policy”In PagerDuty, create an escalation policy attached to the Arbitex service:
- Level 1 (0 min): On-call engineer (15-min acknowledgement window)
- Level 2 (15 min): Secondary on-call + team lead
- Level 3 (30 min): Engineering manager + PagerDuty incident declared
Configure the service in PagerDuty:
- Escalation policy:
Arbitex Platform On-Call - Alert grouping: By alert name + instance
- Auto-resolve: After Alertmanager sends resolved notification
Creating Custom Alerts
Section titled “Creating Custom Alerts”Custom Alert Template
Section titled “Custom Alert Template”groups: - name: custom_alerts interval: 60s rules: - alert: MyCustomAlert # PromQL expression — must evaluate to a scalar expr: <your_metric_expression> <comparison_operator> <threshold> # Duration the condition must be true before firing for: <duration> labels: severity: warning|critical team: platform|ml|finance # Add any custom labels for routing annotations: summary: "<one-line summary>" description: > <Detailed description with {{ $value }} for the metric value and {{ $labels.instance }} for label references.> runbook_url: "<URL to runbook>"Custom Metric Example: Per-Model Error Rate
Section titled “Custom Metric Example: Per-Model Error Rate”- alert: ModelHighErrorRate expr: | rate(arbitex_model_requests_total{status="error"}[5m]) / rate(arbitex_model_requests_total[5m]) > 0.05 for: 5m labels: severity: warning team: platform annotations: summary: "Model {{ $labels.model_id }} error rate above 5%" description: > Model {{ $labels.model_id }} (provider: {{ $labels.provider }}) has an error rate of {{ $value | humanizePercentage }} over the last 5 minutes.Validating Alert Rules
Section titled “Validating Alert Rules”Before deploying, validate rule syntax using promtool:
# Install promtool (bundled with Prometheus binary)promtool check rules /etc/prometheus/rules/*.yml
# Expected output for valid rules:# Checking /etc/prometheus/rules/dlp_alerts.yml# SUCCESS: 8 rules found
# Hot-reload Prometheus after updating rulescurl -X POST http://localhost:9090/-/reloadAlert Silencing
Section titled “Alert Silencing”Temporarily silence alerts during maintenance windows:
# Create a silence via Alertmanager APIcurl -X POST \ -H "Content-Type: application/json" \ -d '{ "matchers": [ {"name": "alertname", "value": "DLPHighLatencyP95", "isRegex": false} ], "startsAt": "2026-03-12T00:00:00Z", "endsAt": "2026-03-12T04:00:00Z", "createdBy": "admin", "comment": "Scheduled maintenance window" }' \ http://localhost:9093/api/v2/silencesOperational Runbooks
Section titled “Operational Runbooks”DLP Latency Runbook
Section titled “DLP Latency Runbook”When DLPHighLatencyP95 fires:
- Check DLP service pod status:
kubectl get pods -l app=arbitex-dlp -n arbitex - Check NER service health:
kubectl logs -l app=arbitex-ner -n arbitex --tail=50 - Check DeBERTa worker queue:
kubectl exec -it deploy/arbitex-platform -- python -m arbitex.tools.queue_status - If NER pod is crash-looping, scale the NER deployment:
kubectl scale deploy/arbitex-ner --replicas=3 -n arbitex - If DeBERTa is the bottleneck, scale workers:
kubectl scale deploy/arbitex-deberta-worker --replicas=2 -n arbitex
Certificate Rotation Runbook
Section titled “Certificate Rotation Runbook”When OutpostMTLSCertCritical fires:
# Rotate the outpost mTLS certificatecurl -X PUT \ -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID/rotate-cert
# Verify new cert expirycurl -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID \ | jq '.cert_expiry'