Alert Configuration & Monitoring

This guide covers the full observability alerting stack for Arbitex AI Gateway: Prometheus alert rules, Grafana dashboard setup, notification channel configuration (PagerDuty, Slack, email), escalation policies, and custom alert creation.

Overview

The Arbitex platform exposes a Prometheus /metrics endpoint on port 9090 (platform) and port 9091 (outpost). Alerts are evaluated by Prometheus Alertmanager and routed to notification channels based on severity and team ownership.

Key alert domains:

DLP pipeline — latency, error rates, and NER/DeBERTa service health
Budget enforcement — per-group spend thresholds and overage events
TLS certificate expiry — platform and outpost mTLS certs
Infrastructure — database connections, Redis availability, queue depth

Prometheus Alert Rules

Alert rules are defined in YAML and loaded by Prometheus via the rule_files configuration key. Platform-0047 ships four production rule groups.

Installing Alert Rule Files

Mount alert rule files into your Prometheus container:

prometheus:
  image: prom/prometheus:v2.51.0
  volumes:
    - ./prometheus/rules:/etc/prometheus/rules
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--web.enable-lifecycle'

Reference rules from prometheus.yml:

rule_files:
  - /etc/prometheus/rules/dlp_alerts.yml
  - /etc/prometheus/rules/budget_alerts.yml
  - /etc/prometheus/rules/cert_alerts.yml
  - /etc/prometheus/rules/infra_alerts.yml

For Kubernetes deployments, store rules in a ConfigMap:

kubectl create configmap prometheus-rules \
  --from-file=rules/ \
  -n monitoring

Reference the ConfigMap in your Prometheus Operator PrometheusRule CRD or mount it as a volume.

DLP Latency Alerts (`dlp_alerts.yml`)

groups:
  - name: dlp_latency
    interval: 30s
    rules:
      # P95 DLP pipeline latency exceeds 500ms over 5 minutes
      - alert: DLPHighLatencyP95
        expr: |
          histogram_quantile(0.95,
            rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "DLP P95 latency above 500ms"
          description: >
            DLP pipeline P95 latency is {{ $value | humanizeDuration }} on
            instance {{ $labels.instance }}. Check NER service and DeBERTa
            worker health.
          runbook_url: "https://docs.arbitex.io/runbooks/dlp-latency"

      # P99 DLP pipeline latency exceeds 2s — critical escalation
      - alert: DLPCriticalLatencyP99
        expr: |
          histogram_quantile(0.99,
            rate(arbitex_dlp_pipeline_duration_seconds_bucket[5m])
          ) > 2.0
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "DLP P99 latency critical (>2s)"
          description: >
            DLP P99 latency is {{ $value | humanizeDuration }}. Immediate
            investigation required — requests may be timing out.

      # DeBERTa worker queue depth
      - alert: DeBERTaQueueDepthHigh
        expr: arbitex_deberta_queue_depth > 50
        for: 3m
        labels:
          severity: warning
          team: ml
        annotations:
          summary: "DeBERTa worker queue backing up"
          description: >
            DeBERTa inference queue depth is {{ $value }} items on
            {{ $labels.instance }}. Consider scaling worker replicas.

DLP Error Rate Alerts

groups:
  - name: dlp_errors
    interval: 30s
    rules:
      # DLP pipeline error rate > 1% over 5 minutes
      - alert: DLPHighErrorRate
        expr: |
          rate(arbitex_dlp_pipeline_errors_total[5m])
          /
          rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "DLP error rate above 1%"
          description: >
            DLP error rate is {{ $value | humanizePercentage }} on
            {{ $labels.instance }}. Check logs for pattern match failures
            or NER service timeouts.

      # Complete DLP pipeline failure (error rate > 10%)
      - alert: DLPPipelineFailure
        expr: |
          rate(arbitex_dlp_pipeline_errors_total[5m])
          /
          rate(arbitex_dlp_pipeline_requests_total[5m]) > 0.10
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "DLP pipeline near-total failure"
          description: >
            DLP error rate is {{ $value | humanizePercentage }}.
            Platform may be serving requests without DLP protection.
            Check DLP service pod status immediately.

      # NER service unavailable
      - alert: NERServiceDown
        expr: up{job="arbitex-ner"} == 0
        for: 1m
        labels:
          severity: critical
          team: ml
        annotations:
          summary: "NER service is down"
          description: "NER service on {{ $labels.instance }} has been down for 1 minute."

Budget Threshold Alerts (`budget_alerts.yml`)

groups:
  - name: budget_enforcement
    interval: 60s
    rules:
      # Group approaching budget limit (80% of monthly)
      - alert: GroupBudgetWarning
        expr: |
          arbitex_group_spend_current_month_usd
          /
          arbitex_group_budget_limit_usd > 0.80
        for: 5m
        labels:
          severity: warning
          team: finance
        annotations:
          summary: "Group {{ $labels.group_id }} at 80% budget"
          description: >
            Group {{ $labels.group_id }} ({{ $labels.group_name }}) has
            consumed {{ $value | humanizePercentage }} of its monthly
            budget (${{ $labels.limit_usd }}).

      # Group exceeded budget
      - alert: GroupBudgetExceeded
        expr: |
          arbitex_group_spend_current_month_usd
          /
          arbitex_group_budget_limit_usd >= 1.0
        for: 0m
        labels:
          severity: critical
          team: finance
        annotations:
          summary: "Group {{ $labels.group_id }} budget exceeded"
          description: >
            Group {{ $labels.group_id }} has exceeded its monthly budget.
            Traffic may be throttled per enforcement policy.

      # Platform-wide spend spike (>200% of 7-day rolling average)
      - alert: PlatformSpendSpike
        expr: |
          rate(arbitex_platform_spend_usd_total[1h])
          /
          rate(arbitex_platform_spend_usd_total[7d] offset 1h) > 2.0
        for: 15m
        labels:
          severity: warning
          team: finance
        annotations:
          summary: "Platform spend rate anomaly detected"
          description: >
            Hourly spend rate is {{ $value }}x the 7-day average.
            Possible cost runaway event — check active groups.

Certificate Expiry Alerts (`cert_alerts.yml`)

groups:
  - name: cert_expiry
    interval: 6h
    rules:
      # mTLS cert expiring within 30 days
      - alert: OutpostMTLSCertExpiringSoon
        expr: |
          (arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time())
          / 86400 < 30
        for: 0m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Outpost mTLS cert expires in < 30 days"
          description: >
            Outpost {{ $labels.outpost_id }} mTLS certificate expires in
            {{ $value | humanizeDuration }}. Rotate via
            PUT /api/admin/outposts/{id}/rotate-cert.
          runbook_url: "https://docs.arbitex.io/runbooks/cert-rotation"

      # mTLS cert expiring within 7 days — critical
      - alert: OutpostMTLSCertCritical
        expr: |
          (arbitex_outpost_mtls_cert_expiry_timestamp_seconds - time())
          / 86400 < 7
        for: 0m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Outpost mTLS cert expires in < 7 days"
          description: >
            URGENT: Outpost {{ $labels.outpost_id }} mTLS certificate
            expires in {{ $value | humanizeDuration }}.

      # Platform TLS cert (API gateway)
      - alert: PlatformTLSCertExpiringSoon
        expr: |
          (arbitex_platform_tls_cert_expiry_timestamp_seconds - time())
          / 86400 < 30
        for: 0m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Platform TLS cert expires in < 30 days"
          description: >
            Platform API TLS certificate expires in {{ $value | humanizeDuration }}.
            Renew via your certificate authority or ACME provider.

Infrastructure Alerts (`infra_alerts.yml`)

groups:
  - name: infrastructure
    interval: 30s
    rules:
      # PostgreSQL connection pool exhaustion
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          arbitex_db_pool_available_connections
          / arbitex_db_pool_max_connections < 0.10
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Database connection pool below 10%"
          description: >
            Only {{ $value | humanizePercentage }} of DB connections are
            available on {{ $labels.instance }}.

      # Redis unavailable
      - alert: RedisUnavailable
        expr: up{job="arbitex-redis"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Redis is unreachable"
          description: >
            Redis on {{ $labels.instance }} has been unreachable for 1
            minute. Auth sessions and rate limiting may be degraded.

      # Request queue depth
      - alert: RequestQueueDepthHigh
        expr: arbitex_request_queue_depth > 1000
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Request queue depth high"
          description: >
            Request queue depth is {{ $value }} items. Consider scaling
            platform replicas.

Grafana Dashboard Setup

The platform ships six pre-built Grafana dashboards (platform-0046). Import them via the Grafana API or UI.

Dashboard Inventory

Dashboard	UID	Description
Platform Overview	`arbitex-overview`	Request rate, latency, error rate, active connections
DLP Analytics	`arbitex-dlp`	Pipeline stages, trigger rates, action distribution
Budget & Cost	`arbitex-budget`	Per-group spend, model cost breakdown, burn rate
Outpost Health	`arbitex-outpost`	Per-outpost request forwarding, latency, cert status
Model Performance	`arbitex-models`	Provider latency P50/P95/P99, error rates by model
Security & Audit	`arbitex-security`	Auth events, MFA usage, policy violations, SIEM events

Importing Dashboards

Via Grafana UI:

Navigate to Dashboards → Import
Upload the JSON file or paste the dashboard JSON
Set the data source to your Prometheus instance
Click Import

Via Grafana API:

# Import a single dashboard
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
  -d @dashboards/arbitex-overview.json \
  http://localhost:3000/api/dashboards/import

# Bulk import all dashboards
for f in dashboards/*.json; do
  curl -s -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
    -d @"$f" \
    http://localhost:3000/api/dashboards/import
done

Via Helm (recommended for Kubernetes):

# values.yaml for kube-prometheus-stack
grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: arbitex
          orgId: 1
          folder: Arbitex
          type: file
          options:
            path: /var/lib/grafana/dashboards/arbitex
  dashboardsConfigMaps:
    arbitex: arbitex-grafana-dashboards

Create the ConfigMap from the dashboard directory:

kubectl create configmap arbitex-grafana-dashboards \
  --from-file=dashboards/ \
  -n monitoring

Key Dashboard Panels

Platform Overview panel queries:

# Request rate
rate(arbitex_http_requests_total[5m])

# P95 response time
histogram_quantile(0.95, rate(arbitex_http_request_duration_seconds_bucket[5m]))

# Error rate
rate(arbitex_http_requests_total{status=~"5.."}[5m])
  / rate(arbitex_http_requests_total[5m])

DLP Analytics panel queries:

# DLP trigger rate by action
rate(arbitex_dlp_triggers_total[5m])

# Actions by type (block/redact/allow/require_approval)
sum by (action) (rate(arbitex_dlp_triggers_total[5m]))

# NER entity detection rate
rate(arbitex_ner_entities_detected_total[5m])

Notification Channels

Alertmanager routes alerts to notification channels based on severity and label matchers.

Alertmanager Base Configuration

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Critical alerts → PagerDuty
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 0s
      repeat_interval: 1h
      continue: true

    # Finance team alerts
    - match:
        team: finance
      receiver: slack-finance
      continue: true

    # ML team alerts
    - match:
        team: ml
      receiver: slack-ml

    # Warning alerts → Slack platform channel
    - match:
        severity: warning
      receiver: slack-platform

PagerDuty Configuration

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_ROUTING_KEY}'
        send_resolved: true
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          description: '{{ .CommonAnnotations.description }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
        client: 'Arbitex Alertmanager'
        client_url: '{{ template "arbitex.alertmanagerURL" . }}'

Required secrets:

# Create Alertmanager secret in Kubernetes
kubectl create secret generic alertmanager-secrets \
  --from-literal=PAGERDUTY_ROUTING_KEY=your-routing-key \
  --from-literal=SMTP_PASSWORD=your-smtp-password \
  -n monitoring

Slack Configuration

receivers:
  - name: slack-platform
    slack_configs:
      - api_url: '${SLACK_PLATFORM_WEBHOOK_URL}'
        send_resolved: true
        channel: '#platform-alerts'
        title: '{{ template "slack.arbitex.title" . }}'
        text: '{{ template "slack.arbitex.text" . }}'
        color: '{{ if eq .Status "firing" }}{{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
        actions:
          - type: button
            text: 'View in Grafana'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Silence Alert'
            url: '{{ template "arbitex.silenceURL" . }}'

  - name: slack-finance
    slack_configs:
      - api_url: '${SLACK_FINANCE_WEBHOOK_URL}'
        send_resolved: true
        channel: '#finance-alerts'
        title: 'Budget Alert: {{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: slack-ml
    slack_configs:
      - api_url: '${SLACK_ML_WEBHOOK_URL}'
        send_resolved: true
        channel: '#ml-alerts'

Slack alert template (/etc/alertmanager/templates/slack.tmpl):

{{ define "slack.arbitex.title" -}}
  [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
  {{ .GroupLabels.alertname }}
{{- end }}

{{ define "slack.arbitex.text" -}}
  {{ range .Alerts -}}
    *Alert:* {{ .Labels.alertname }}
    *Severity:* {{ .Labels.severity }}
    *Summary:* {{ .Annotations.summary }}
    *Description:* {{ .Annotations.description }}
    {{ if .Annotations.runbook_url -}}
    *Runbook:* {{ .Annotations.runbook_url }}
    {{- end }}
  {{ end -}}
{{- end }}

Email Configuration

receivers:
  - name: email-oncall
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
        headers:
          subject: '[Arbitex Alert] {{ .GroupLabels.alertname }} - {{ .Status | toUpper }}'
        html: '{{ template "email.arbitex.html" . }}'
        require_tls: true

  - name: default-receiver
    email_configs:
      - to: 'platform-team@example.com'
        send_resolved: true

Escalation Policies

Define multi-tier escalation in Alertmanager routing with repeat_interval staggering.

Standard Escalation Matrix

Severity	Initial Notify	Escalate After	Escalate To
`info`	Slack channel	—	—
`warning`	Slack channel	2 hours unresolved	On-call Slack DM
`critical`	PagerDuty + Slack	30 min unresolved	PagerDuty escalation policy

Alertmanager Escalation Routing

route:
  routes:
    # Critical → PagerDuty immediately
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 0s
      repeat_interval: 30m

    # Warning → Slack, escalate via inhibit rule after 2h
    - match:
        severity: warning
      receiver: slack-platform
      group_wait: 1m
      repeat_interval: 2h

# Inhibit rules — suppress warning if critical fires for same alert
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal:
      - alertname
      - instance

PagerDuty Escalation Policy

In PagerDuty, create an escalation policy attached to the Arbitex service:

Level 1 (0 min): On-call engineer (15-min acknowledgement window)
Level 2 (15 min): Secondary on-call + team lead
Level 3 (30 min): Engineering manager + PagerDuty incident declared

Configure the service in PagerDuty:

Escalation policy: Arbitex Platform On-Call
Alert grouping: By alert name + instance
Auto-resolve: After Alertmanager sends resolved notification

Creating Custom Alerts

Custom Alert Template

groups:
  - name: custom_alerts
    interval: 60s
    rules:
      - alert: MyCustomAlert
        # PromQL expression — must evaluate to a scalar
        expr: <your_metric_expression> <comparison_operator> <threshold>
        # Duration the condition must be true before firing
        for: <duration>
        labels:
          severity: warning|critical
          team: platform|ml|finance
          # Add any custom labels for routing
        annotations:
          summary: "<one-line summary>"
          description: >
            <Detailed description with {{ $value }} for the metric value
            and {{ $labels.instance }} for label references.>
          runbook_url: "<URL to runbook>"

Custom Metric Example: Per-Model Error Rate

- alert: ModelHighErrorRate
  expr: |
    rate(arbitex_model_requests_total{status="error"}[5m])
    /
    rate(arbitex_model_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Model {{ $labels.model_id }} error rate above 5%"
    description: >
      Model {{ $labels.model_id }} (provider: {{ $labels.provider }})
      has an error rate of {{ $value | humanizePercentage }} over the
      last 5 minutes.

Validating Alert Rules

Before deploying, validate rule syntax using promtool:

# Install promtool (bundled with Prometheus binary)
promtool check rules /etc/prometheus/rules/*.yml

# Expected output for valid rules:
# Checking /etc/prometheus/rules/dlp_alerts.yml
#   SUCCESS: 8 rules found

# Hot-reload Prometheus after updating rules
curl -X POST http://localhost:9090/-/reload

Alert Silencing

Temporarily silence alerts during maintenance windows:

# Create a silence via Alertmanager API
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "DLPHighLatencyP95", "isRegex": false}
    ],
    "startsAt": "2026-03-12T00:00:00Z",
    "endsAt": "2026-03-12T04:00:00Z",
    "createdBy": "admin",
    "comment": "Scheduled maintenance window"
  }' \
  http://localhost:9093/api/v2/silences

Operational Runbooks

DLP Latency Runbook

When DLPHighLatencyP95 fires:

Check DLP service pod status: kubectl get pods -l app=arbitex-dlp -n arbitex
Check NER service health: kubectl logs -l app=arbitex-ner -n arbitex --tail=50
Check DeBERTa worker queue: kubectl exec -it deploy/arbitex-platform -- python -m arbitex.tools.queue_status
If NER pod is crash-looping, scale the NER deployment: kubectl scale deploy/arbitex-ner --replicas=3 -n arbitex
If DeBERTa is the bottleneck, scale workers: kubectl scale deploy/arbitex-deberta-worker --replicas=2 -n arbitex

Certificate Rotation Runbook

When OutpostMTLSCertCritical fires:

# Rotate the outpost mTLS certificate
curl -X PUT \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID/rotate-cert

# Verify new cert expiry
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.arbitex.example.com/api/admin/outposts/$OUTPOST_ID \
  | jq '.cert_expiry'

Alert Configuration & Monitoring

Alert Configuration & Monitoring

Overview

Prometheus Alert Rules

Installing Alert Rule Files

DLP Latency Alerts (dlp_alerts.yml)

DLP Error Rate Alerts

Budget Threshold Alerts (budget_alerts.yml)

Certificate Expiry Alerts (cert_alerts.yml)

Infrastructure Alerts (infra_alerts.yml)

Grafana Dashboard Setup

Dashboard Inventory

Importing Dashboards

Key Dashboard Panels

Notification Channels

Alertmanager Base Configuration

PagerDuty Configuration

Slack Configuration

Email Configuration

Escalation Policies

Standard Escalation Matrix

Alertmanager Escalation Routing

PagerDuty Escalation Policy

Creating Custom Alerts

Custom Alert Template

Custom Metric Example: Per-Model Error Rate

Validating Alert Rules

Alert Silencing

Operational Runbooks

DLP Latency Runbook

Certificate Rotation Runbook

Related Documentation

DLP Latency Alerts (`dlp_alerts.yml`)

Budget Threshold Alerts (`budget_alerts.yml`)

Certificate Expiry Alerts (`cert_alerts.yml`)

Infrastructure Alerts (`infra_alerts.yml`)