Skip to content

Prometheus alerting rules reference

Arbitex ships eight Prometheus alerting rules in deploy/prometheus/alert-rules.yml under the group arbitex.platform. Load this file into your Prometheus instance using the rule_files directive or a Kubernetes PrometheusRule CRD.


Add to prometheus.yml:

rule_files:
- /etc/prometheus/rules/alert-rules.yml

Copy deploy/prometheus/alert-rules.yml to /etc/prometheus/rules/ on your Prometheus host and reload:

Terminal window
curl -X POST http://localhost:9090/-/reload

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: arbitex-platform
namespace: monitoring
labels:
prometheus: kube-prometheus # match your Prometheus selector
spec:
groups:
# paste the contents of alert-rules.yml groups array here

FieldValue
Severitywarning
Fires whenp95 DLP scan duration > 500 ms for 5 consecutive minutes
Expressionhistogram_quantile(0.95, sum(rate(dlp_scan_duration_seconds_bucket[5m])) by (le)) > 0.5
Pending durationfor: 5m

The 95th-percentile DLP scan duration has exceeded 500 ms. This typically indicates NER microservice (GLiNER) saturation or GPU resource pressure.

Recommended actions:

  1. Check the NER microservice pod health: kubectl get pods -n arbitex | grep ner
  2. Review GPU utilization on NER nodes (check DCGM exporter metrics or cloud GPU dashboards)
  3. If running Tier 3 (DeBERTa NLI) scans, evaluate whether the DeBERTa threshold can be raised to reduce Tier 3 invocations
  4. Consider horizontal scaling of the NER deployment

FieldValue
Severitycritical
Fires whenError rate for any single provider > 5% over 5 minutes
Expression(sum(rate(provider_errors_total[5m])) by (provider) / sum(rate(provider_requests_total[5m])) by (provider)) > 0.05
Pending durationfor: 5m
Labelsprovider label identifies the affected provider

A specific LLM provider is returning errors on more than 5% of requests. The provider label in the alert identifies which provider is affected (e.g., anthropic, openai, azure_openai).

Recommended actions:

  1. Check the provider’s status page for incidents
  2. Verify the provider API key is valid and has not expired
  3. Review circuit-breaker state in the Arbitex admin UI (Admin → Routing)
  4. If the error rate is sustained, enable provider failover routing to an alternative provider

FieldValue
Severitywarning
Fires whenAny organization’s budget utilization ratio > 0.80 (80%) — fires immediately
Expressionbudget_utilization_ratio > 0.80
Pending durationfor: 0m (immediate)
Labelsorg_id label identifies the affected organization

An organization has consumed more than 80% of its current billing-period budget.

Recommended actions:

  1. Review usage trends in the Usage & Billing Grafana dashboard
  2. Contact the tenant to discuss usage patterns or raise the budget limit
  3. Consider setting a soft rate limit to slow consumption if the tenant is approaching exhaustion

FieldValue
Severitycritical
Fires whenAny organization’s budget utilization ratio > 0.95 (95%) — fires immediately
Expressionbudget_utilization_ratio > 0.95
Pending durationfor: 0m (immediate)
Labelsorg_id label identifies the affected organization

An organization has consumed more than 95% of its billing-period budget. Service interruption may occur if the budget is fully exhausted.

Recommended actions:

  1. Raise the organization’s budget limit immediately if warranted
  2. If the usage is unexpected (possible API key leak or runaway automation), suspend the API key via Admin → API Keys
  3. Review the Credential Intelligence dashboard for leaked key indicators
  4. Notify the tenant — this fires in real time

FieldValue
Severitywarning
Fires whenAny TLS certificate expires in less than 30 days — fires immediately
Expressiontls_cert_expiry_seconds < (30 * 24 * 3600)
Pending durationfor: 0m (immediate)
Labelsdomain label identifies the affected domain

A TLS certificate is expiring within 30 days.

Recommended actions:

  1. Trigger certificate renewal via your ACME client (cert-manager, certbot, or cloud provider ACM/Let’s Encrypt)
  2. Verify the renewed certificate is deployed and the old one replaced
  3. For Outpost deployments using self-managed certificates, follow the Air-Gap Operations guide certificate rotation procedure

FieldValue
Severitycritical
Fires whenAny HMAC audit chain verification failure occurs within a 5-minute window — fires immediately
Expressionincrease(audit_chain_breaks_total[5m]) > 0
Pending durationfor: 0m (immediate)

One or more audit log HMAC chain verification failures have been recorded. This may indicate log tampering or an interrupted key rotation.

This alert is a compliance-critical indicator. Treat any firing as a potential integrity incident.

Recommended actions:

  1. Immediately freeze the audit log to prevent further writes if tampering is suspected
  2. Run the audit log verification procedure: see Audit Log Verification
  3. Check recent key rotation activity — an incomplete rotation can produce false chain breaks
  4. Preserve log artifacts for forensic review before any remediation

FieldValue
Severitywarning
Fires whenp99 inference latency for /api/chat* endpoints > 30 s over 5 minutes
Expressionhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{handler=~"/api/chat.*"}[5m])) by (le)) > 30
Pending durationfor: 5m

The 99th-percentile inference request duration has exceeded 30 seconds. This indicates severely degraded end-to-end response times for users, likely caused by slow provider responses, stream back-pressure, or connection pool exhaustion.

Recommended actions:

  1. Check provider latency in the Provider Performance Grafana dashboard
  2. Review provider circuit-breaker state and connection pool metrics
  3. Check for stream back-pressure issues — long-running streaming requests can inflate p99 latency
  4. If a specific provider is causing the delay, route to an alternative via Admin → Routing

FieldValue
Severitycritical
Fires whenPlatform-wide HTTP 5xx error rate > 1% over 5 minutes
Expression(sum(rate(http_request_errors_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01
Pending durationfor: 5m

The platform is returning 5xx responses on more than 1% of all requests. This indicates a systemic error — a deployment issue, dependency failure, or infrastructure problem.

Recommended actions:

  1. Check application logs for exception traces and error patterns
  2. Review recent deployments — consider rollback if errors began after a deploy
  3. Check database connectivity and query performance
  4. Verify all external dependencies (provider APIs, Redis, PostgreSQL) are healthy
  5. Review the System Health Grafana dashboard for correlated latency spikes

AlertSeverityThresholdPendingKey label
DLPScanLatencyHighwarningp95 DLP > 500 ms5 m
ProviderErrorRateHighcriticalprovider errors > 5%5 mprovider
BudgetThreshold80warningbudget > 80%immediateorg_id
BudgetThreshold95criticalbudget > 95%immediateorg_id
CertExpiryNearwarningcert < 30 d remainingimmediatedomain
AuditChainBreakcriticalany chain breakimmediate
InferenceTimeoutwarningp99 inference > 30 s5 m
ErrorRateSpikecritical5xx > 1%5 m