Skip to content

Alert Configuration & Monitoring

This guide covers the full observability alerting stack for Arbitex AI Gateway: Prometheus alert rules, Grafana dashboard setup, notification channel configuration (PagerDuty, Slack, email), escalation policies, and custom alert creation.

The Arbitex platform exposes a Prometheus /metrics endpoint on port 9090 (platform) and port 9091 (outpost). Alerts are evaluated by Prometheus Alertmanager and routed to notification channels based on severity and team ownership.

Key alert domains:

  • DLP pipeline — latency, error rates, and NER/DeBERTa service health
  • Budget enforcement — per-group spend thresholds and overage events
  • TLS certificate expiry — platform and outpost mTLS certs
  • Infrastructure — database connections, Redis availability, queue depth

Alert rules are defined in YAML and loaded by Prometheus via the rule_files configuration key. Platform-0047 ships four production rule groups.

Mount alert rule files into your Prometheus container:

docker-compose.yml
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./prometheus/rules:/etc/prometheus/rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-lifecycle'

Reference rules from prometheus.yml:

prometheus.yml
rule_files:
- /etc/prometheus/rules/dlp_alerts.yml
- /etc/prometheus/rules/budget_alerts.yml
- /etc/prometheus/rules/cert_alerts.yml
- /etc/prometheus/rules/infra_alerts.yml

For Kubernetes deployments, store rules in a ConfigMap:

Terminal window
kubectl create configmap prometheus-rules \
--from-file=rules/ \
-n monitoring

Reference the ConfigMap in your Prometheus Operator PrometheusRule CRD or mount it as a volume.

groups:
- name: dlp_latency
interval: 30s
rules:
# P95 DLP pipeline latency exceeds 500ms over 5 minutes
- alert: DLPHighLatencyP95
expr: |
histogram_quantile(0.95,
rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m])
) > 0.5
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "DLP P95 latency above 500ms"
description: >
DLP pipeline P95 latency is {{ $value | humanizeDuration }} on
instance {{ $labels.instance }}. Check NER service and DeBERTa
worker health.
runbook_url: "https://docs.arbitex.io/runbooks/dlp-latency"
# P99 DLP pipeline latency exceeds 2s — critical escalation
- alert: DLPCriticalLatencyP99
expr: |
histogram_quantile(0.99,
rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m])
) > 2.0
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "DLP P99 latency critical (>2s)"
description: >
DLP P99 latency is {{ $value | humanizeDuration }}. Immediate
investigation required — requests may be timing out.
# DeBERTa worker queue depth
- alert: DeBERTaQueueDepthHigh
expr: arbitex_deberta_queue_depth > 50
for: 3m
labels:
severity: warning
team: ml
annotations:
summary: "DeBERTa worker queue backing up"
description: >
DeBERTa inference queue depth is {{ $value }} items on
{{ $labels.instance }}. Consider scaling worker replicas.
groups:
- name: dlp_errors
interval: 30s
rules:
# DLP pipeline error rate > 1% over 5 minutes
- alert: DLPHighErrorRate
expr: |
rate(arbitex_dlp_pipeline_errors_total[5m])
/
rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "DLP error rate above 1%"
description: >
DLP error rate is {{ $value | humanizePercentage }} on
{{ $labels.instance }}. Check logs for pattern match failures
or NER service timeouts.
# Complete DLP pipeline failure (error rate > 10%)
- alert: DLPPipelineFailure
expr: |
rate(arbitex_dlp_pipeline_errors_total[5m])
/
rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.10
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "DLP pipeline near-total failure"
description: >
DLP error rate is {{ $value | humanizePercentage }}.
Platform may be serving requests without DLP protection.
Check DLP service pod status immediately.
# NER service unavailable
- alert: NERServiceDown
expr: up{job="arbitex-ner"} == 0
for: 1m
labels:
severity: critical
team: ml
annotations:
summary: "NER service is down"
description: "NER service on {{ $labels.instance }} has been down for 1 minute."

Budget Threshold Alerts (budget_alerts.yml)

Section titled “Budget Threshold Alerts (budget_alerts.yml)”
groups:
- name: budget_enforcement
interval: 60s
rules:
# Group approaching budget limit (80% of monthly)
- alert: GroupBudgetWarning
expr: |
arbitex_group_spend_current_month_usd
/
arbitex_group_budget_limit_usd > 0.80
for: 5m
labels:
severity: warning
team: finance
annotations:
summary: "Group {{ $labels.group_id }} at 80% budget"
description: >
Group {{ $labels.group_id }} ({{ $labels.group_name }}) has
consumed {{ $value | humanizePercentage }} of its monthly
budget (${{ $labels.limit_usd }}).
# Group exceeded budget
- alert: GroupBudgetExceeded
expr: |
arbitex_group_spend_current_month_usd
/
arbitex_group_budget_limit_usd >= 1.0
for: 0m
labels:
severity: critical
team: finance
annotations:
summary: "Group {{ $labels.group_id }} budget exceeded"
description: >
Group {{ $labels.group_id }} has exceeded its monthly budget.
Traffic may be throttled per enforcement policy.
# Platform-wide spend spike (>200% of 7-day rolling average)
- alert: PlatformSpendSpike
expr: |
rate(arbitex_platform_spend_usd_total[1h])
/
rate(arbitex_platform_spend_usd_total[7d] offset 1h) > 2.0
for: 15m
labels:
severity: warning
team: finance
annotations:
summary: "Platform spend rate anomaly detected"
description: >
Hourly spend rate is {{ $value }}x the 7-day average.
Possible cost runaway event — check active groups.

Certificate Expiry Alerts (cert_alerts.yml)

Section titled “Certificate Expiry Alerts (cert_alerts.yml)”
groups:
- name: cert_expiry
interval: 6h
rules:
# mTLS cert expiring within 30 days
- alert: OutpostMTLSCertExpiringSoon
expr: |
(arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time())
/ 86400 < 30
for: 0m
labels:
severity: warning
team: platform
annotations:
summary: "Outpost mTLS cert expires in < 30 days"
description: >
Outpost {{ $labels.outpost_id }} mTLS certificate expires in
{{ $value | humanizeDuration }}. Rotate via
PUT /api/admin/outposts/{id}/rotate-cert.
runbook_url: "https://docs.arbitex.io/runbooks/cert-rotation"
# mTLS cert expiring within 7 days — critical
- alert: OutpostMTLSCertCritical
expr: |
(arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time())
/ 86400 < 7
for: 0m
labels:
severity: critical
team: platform
annotations:
summary: "Outpost mTLS cert expires in < 7 days"
description: >
URGENT: Outpost {{ $labels.outpost_id }} mTLS certificate
expires in {{ $value | humanizeDuration }}.
# Platform TLS cert (API gateway)
- alert: PlatformTLSCertExpiringSoon
expr: |
(arbitex_platform_tls_cert_expiry_timestamp_seconds - time())
/ 86400 < 30
for: 0m
labels:
severity: warning
team: platform
annotations:
summary: "Platform TLS cert expires in < 30 days"
description: >
Platform API TLS certificate expires in {{ $value | humanizeDuration }}.
Renew via your certificate authority or ACME provider.
groups:
- name: infrastructure
interval: 30s
rules:
# PostgreSQL connection pool exhaustion
- alert: DatabaseConnectionPoolExhausted
expr: |
arbitex_db_pool_available_connections
/ arbitex_db_pool_max_connections < 0.10
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Database connection pool below 10%"
description: >
Only {{ $value | humanizePercentage }} of DB connections are
available on {{ $labels.instance }}.
# Redis unavailable
- alert: RedisUnavailable
expr: up{job="arbitex-redis"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Redis is unreachable"
description: >
Redis on {{ $labels.instance }} has been unreachable for 1
minute. Auth sessions and rate limiting may be degraded.
# Request queue depth
- alert: RequestQueueDepthHigh
expr: arbitex_request_queue_depth > 1000
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Request queue depth high"
description: >
Request queue depth is {{ $value }} items. Consider scaling
platform replicas.

The platform ships six pre-built Grafana dashboards (platform-0046). Import them via the Grafana API or UI.

DashboardUIDDescription
Platform Overviewarbitex-overviewRequest rate, latency, error rate, active connections
DLP Analyticsarbitex-dlpPipeline stages, trigger rates, action distribution
Budget & Costarbitex-budgetPer-group spend, model cost breakdown, burn rate
Outpost Healtharbitex-outpostPer-outpost request forwarding, latency, cert status
Model Performancearbitex-modelsProvider latency P50/P95/P99, error rates by model
Security & Auditarbitex-securityAuth events, MFA usage, policy violations, SIEM events

Via Grafana UI:

  1. Navigate to Dashboards → Import
  2. Upload the JSON file or paste the dashboard JSON
  3. Set the data source to your Prometheus instance
  4. Click Import

Via Grafana API:

Terminal window
# Import a single dashboard
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_API_TOKEN" \
-d @dashboards/arbitex-overview.json \
http://localhost:3000/api/dashboards/import
# Bulk import all dashboards
for f in dashboards/*.json; do
curl -s -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_API_TOKEN" \
-d @"$f" \
http://localhost:3000/api/dashboards/import
done

Via Helm (recommended for Kubernetes):

# values.yaml for kube-prometheus-stack
grafana:
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: arbitex
orgId: 1
folder: Arbitex
type: file
options:
path: /var/lib/grafana/dashboards/arbitex
dashboardsConfigMaps:
arbitex: arbitex-grafana-dashboards

Create the ConfigMap from the dashboard directory:

Terminal window
kubectl create configmap arbitex-grafana-dashboards \
--from-file=dashboards/ \
-n monitoring

Platform Overview panel queries:

# Request rate
rate(arbitex_http_requests_total[5m])
# P95 response time
histogram_quantile(0.95, rate(arbitex_http_request_duration_seconds_bucket[5m]))
# Error rate
rate(arbitex_http_requests_total{status=~"5.."}[5m])
/ rate(arbitex_http_requests_total[5m])

DLP Analytics panel queries:

# DLP trigger rate by action
rate(arbitex_dlp_triggers_total[5m])
# Actions by type (block/redact/allow/require_approval)
sum by (action) (rate(arbitex_dlp_triggers_total[5m]))
# NER entity detection rate
rate(arbitex_ner_entities_detected_total[5m])

Alertmanager routes alerts to notification channels based on severity and label matchers.

alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: '${SMTP_PASSWORD}'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty-critical
group_wait: 0s
repeat_interval: 1h
continue: true
# Finance team alerts
- match:
team: finance
receiver: slack-finance
continue: true
# ML team alerts
- match:
team: ml
receiver: slack-ml
# Warning alerts → Slack platform channel
- match:
severity: warning
receiver: slack-platform
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: '${PAGERDUTY_ROUTING_KEY}'
send_resolved: true
severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
description: '{{ .CommonAnnotations.description }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
client: 'Arbitex Alertmanager'
client_url: '{{ template "arbitex.alertmanagerURL" . }}'

Required secrets:

Terminal window
# Create Alertmanager secret in Kubernetes
kubectl create secret generic alertmanager-secrets \
--from-literal=PAGERDUTY_ROUTING_KEY=your-routing-key \
--from-literal=SMTP_PASSWORD=your-smtp-password \
-n monitoring
receivers:
- name: slack-platform
slack_configs:
- api_url: '${SLACK_PLATFORM_WEBHOOK_URL}'
send_resolved: true
channel: '#platform-alerts'
title: '{{ template "slack.arbitex.title" . }}'
text: '{{ template "slack.arbitex.text" . }}'
color: '{{ if eq .Status "firing" }}{{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
actions:
- type: button
text: 'View in Grafana'
url: '{{ .CommonAnnotations.runbook_url }}'
- type: button
text: 'Silence Alert'
url: '{{ template "arbitex.silenceURL" . }}'
- name: slack-finance
slack_configs:
- api_url: '${SLACK_FINANCE_WEBHOOK_URL}'
send_resolved: true
channel: '#finance-alerts'
title: 'Budget Alert: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: slack-ml
slack_configs:
- api_url: '${SLACK_ML_WEBHOOK_URL}'
send_resolved: true
channel: '#ml-alerts'

Slack alert template (/etc/alertmanager/templates/slack.tmpl):

{{ define "slack.arbitex.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.alertname }}
{{- end }}
{{ define "slack.arbitex.text" -}}
{{ range .Alerts -}}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ if .Annotations.runbook_url -}}
*Runbook:* {{ .Annotations.runbook_url }}
{{- end }}
{{ end -}}
{{- end }}
receivers:
- name: email-oncall
email_configs:
- to: 'oncall@example.com'
send_resolved: true
headers:
subject: '[Arbitex Alert] {{ .GroupLabels.alertname }} - {{ .Status | toUpper }}'
html: '{{ template "email.arbitex.html" . }}'
require_tls: true
- name: default-receiver
email_configs:
- to: 'platform-team@example.com'
send_resolved: true

Define multi-tier escalation in Alertmanager routing with repeat_interval staggering.

SeverityInitial NotifyEscalate AfterEscalate To
infoSlack channel
warningSlack channel2 hours unresolvedOn-call Slack DM
criticalPagerDuty + Slack30 min unresolvedPagerDuty escalation policy
route:
routes:
# Critical → PagerDuty immediately
- match:
severity: critical
receiver: pagerduty-critical
group_wait: 0s
repeat_interval: 30m
# Warning → Slack, escalate via inhibit rule after 2h
- match:
severity: warning
receiver: slack-platform
group_wait: 1m
repeat_interval: 2h
# Inhibit rules — suppress warning if critical fires for same alert
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal:
- alertname
- instance

In PagerDuty, create an escalation policy attached to the Arbitex service:

  1. Level 1 (0 min): On-call engineer (15-min acknowledgement window)
  2. Level 2 (15 min): Secondary on-call + team lead
  3. Level 3 (30 min): Engineering manager + PagerDuty incident declared

Configure the service in PagerDuty:

  • Escalation policy: Arbitex Platform On-Call
  • Alert grouping: By alert name + instance
  • Auto-resolve: After Alertmanager sends resolved notification
groups:
- name: custom_alerts
interval: 60s
rules:
- alert: MyCustomAlert
# PromQL expression — must evaluate to a scalar
expr: <your_metric_expression> <comparison_operator> <threshold>
# Duration the condition must be true before firing
for: <duration>
labels:
severity: warning|critical
team: platform|ml|finance
# Add any custom labels for routing
annotations:
summary: "<one-line summary>"
description: >
<Detailed description with {{ $value }} for the metric value
and {{ $labels.instance }} for label references.>
runbook_url: "<URL to runbook>"

Custom Metric Example: Per-Model Error Rate

Section titled “Custom Metric Example: Per-Model Error Rate”
- alert: ModelHighErrorRate
expr: |
rate(arbitex_model_requests_total{status="error"}[5m])
/
rate(arbitex_model_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Model {{ $labels.model_id }} error rate above 5%"
description: >
Model {{ $labels.model_id }} (provider: {{ $labels.provider }})
has an error rate of {{ $value | humanizePercentage }} over the
last 5 minutes.

Before deploying, validate rule syntax using promtool:

Terminal window
# Install promtool (bundled with Prometheus binary)
promtool check rules /etc/prometheus/rules/*.yml
# Expected output for valid rules:
# Checking /etc/prometheus/rules/dlp_alerts.yml
# SUCCESS: 8 rules found
# Hot-reload Prometheus after updating rules
curl -X POST http://localhost:9090/-/reload

Temporarily silence alerts during maintenance windows:

Terminal window
# Create a silence via Alertmanager API
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "DLPHighLatencyP95", "isRegex": false}
],
"startsAt": "2026-03-12T00:00:00Z",
"endsAt": "2026-03-12T04:00:00Z",
"createdBy": "admin",
"comment": "Scheduled maintenance window"
}' \
http://localhost:9093/api/v2/silences

When DLPHighLatencyP95 fires:

  1. Check DLP service pod status: kubectl get pods -l app=arbitex-dlp -n arbitex
  2. Check NER service health: kubectl logs -l app=arbitex-ner -n arbitex --tail=50
  3. Check DeBERTa worker queue: kubectl exec -it deploy/arbitex-platform -- python -m arbitex.tools.queue_status
  4. If NER pod is crash-looping, scale the NER deployment: kubectl scale deploy/arbitex-ner --replicas=3 -n arbitex
  5. If DeBERTa is the bottleneck, scale workers: kubectl scale deploy/arbitex-deberta-worker --replicas=2 -n arbitex

When OutpostMTLSCertCritical fires:

Terminal window
# Rotate the outpost mTLS certificate
curl -X PUT \
-H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID/rotate-cert
# Verify new cert expiry
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID \
| jq '.cert_expiry'