Prometheus alerting rules reference
Arbitex ships eight Prometheus alerting rules in deploy/prometheus/alert-rules.yml under the group arbitex.platform. Load this file into your Prometheus instance using the rule_files directive or a Kubernetes PrometheusRule CRD.
Loading the rules
Section titled “Loading the rules”Standalone Prometheus
Section titled “Standalone Prometheus”Add to prometheus.yml:
rule_files: - /etc/prometheus/rules/alert-rules.ymlCopy deploy/prometheus/alert-rules.yml to /etc/prometheus/rules/ on your Prometheus host and reload:
curl -X POST http://localhost:9090/-/reloadKubernetes with Prometheus Operator
Section titled “Kubernetes with Prometheus Operator”Create a PrometheusRule resource:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: arbitex-platform namespace: monitoring labels: prometheus: kube-prometheus # match your Prometheus selectorspec: groups: # paste the contents of alert-rules.yml groups array hereAlert rules
Section titled “Alert rules”DLPScanLatencyHigh
Section titled “DLPScanLatencyHigh”| Field | Value |
|---|---|
| Severity | warning |
| Fires when | p95 DLP scan duration > 500 ms for 5 consecutive minutes |
| Expression | histogram_quantile(0.95, sum(rate(dlp_scan_duration_seconds_bucket[5m])) by (le)) > 0.5 |
| Pending duration | for: 5m |
The 95th-percentile DLP scan duration has exceeded 500 ms. This typically indicates NER microservice (GLiNER) saturation or GPU resource pressure.
Recommended actions:
- Check the NER microservice pod health:
kubectl get pods -n arbitex | grep ner - Review GPU utilization on NER nodes (check DCGM exporter metrics or cloud GPU dashboards)
- If running Tier 3 (DeBERTa NLI) scans, evaluate whether the DeBERTa threshold can be raised to reduce Tier 3 invocations
- Consider horizontal scaling of the NER deployment
ProviderErrorRateHigh
Section titled “ProviderErrorRateHigh”| Field | Value |
|---|---|
| Severity | critical |
| Fires when | Error rate for any single provider > 5% over 5 minutes |
| Expression | (sum(rate(provider_errors_total[5m])) by (provider) / sum(rate(provider_requests_total[5m])) by (provider)) > 0.05 |
| Pending duration | for: 5m |
| Labels | provider label identifies the affected provider |
A specific LLM provider is returning errors on more than 5% of requests. The provider label in the alert identifies which provider is affected (e.g., anthropic, openai, azure_openai).
Recommended actions:
- Check the provider’s status page for incidents
- Verify the provider API key is valid and has not expired
- Review circuit-breaker state in the Arbitex admin UI (Admin → Routing)
- If the error rate is sustained, enable provider failover routing to an alternative provider
BudgetThreshold80
Section titled “BudgetThreshold80”| Field | Value |
|---|---|
| Severity | warning |
| Fires when | Any organization’s budget utilization ratio > 0.80 (80%) — fires immediately |
| Expression | budget_utilization_ratio > 0.80 |
| Pending duration | for: 0m (immediate) |
| Labels | org_id label identifies the affected organization |
An organization has consumed more than 80% of its current billing-period budget.
Recommended actions:
- Review usage trends in the Usage & Billing Grafana dashboard
- Contact the tenant to discuss usage patterns or raise the budget limit
- Consider setting a soft rate limit to slow consumption if the tenant is approaching exhaustion
BudgetThreshold95
Section titled “BudgetThreshold95”| Field | Value |
|---|---|
| Severity | critical |
| Fires when | Any organization’s budget utilization ratio > 0.95 (95%) — fires immediately |
| Expression | budget_utilization_ratio > 0.95 |
| Pending duration | for: 0m (immediate) |
| Labels | org_id label identifies the affected organization |
An organization has consumed more than 95% of its billing-period budget. Service interruption may occur if the budget is fully exhausted.
Recommended actions:
- Raise the organization’s budget limit immediately if warranted
- If the usage is unexpected (possible API key leak or runaway automation), suspend the API key via Admin → API Keys
- Review the Credential Intelligence dashboard for leaked key indicators
- Notify the tenant — this fires in real time
CertExpiryNear
Section titled “CertExpiryNear”| Field | Value |
|---|---|
| Severity | warning |
| Fires when | Any TLS certificate expires in less than 30 days — fires immediately |
| Expression | tls_cert_expiry_seconds < (30 * 24 * 3600) |
| Pending duration | for: 0m (immediate) |
| Labels | domain label identifies the affected domain |
A TLS certificate is expiring within 30 days.
Recommended actions:
- Trigger certificate renewal via your ACME client (cert-manager, certbot, or cloud provider ACM/Let’s Encrypt)
- Verify the renewed certificate is deployed and the old one replaced
- For Outpost deployments using self-managed certificates, follow the Air-Gap Operations guide certificate rotation procedure
AuditChainBreak
Section titled “AuditChainBreak”| Field | Value |
|---|---|
| Severity | critical |
| Fires when | Any HMAC audit chain verification failure occurs within a 5-minute window — fires immediately |
| Expression | increase(audit_chain_breaks_total[5m]) > 0 |
| Pending duration | for: 0m (immediate) |
One or more audit log HMAC chain verification failures have been recorded. This may indicate log tampering or an interrupted key rotation.
This alert is a compliance-critical indicator. Treat any firing as a potential integrity incident.
Recommended actions:
- Immediately freeze the audit log to prevent further writes if tampering is suspected
- Run the audit log verification procedure: see Audit Log Verification
- Check recent key rotation activity — an incomplete rotation can produce false chain breaks
- Preserve log artifacts for forensic review before any remediation
InferenceTimeout
Section titled “InferenceTimeout”| Field | Value |
|---|---|
| Severity | warning |
| Fires when | p99 inference latency for /api/chat* endpoints > 30 s over 5 minutes |
| Expression | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{handler=~"/api/chat.*"}[5m])) by (le)) > 30 |
| Pending duration | for: 5m |
The 99th-percentile inference request duration has exceeded 30 seconds. This indicates severely degraded end-to-end response times for users, likely caused by slow provider responses, stream back-pressure, or connection pool exhaustion.
Recommended actions:
- Check provider latency in the Provider Performance Grafana dashboard
- Review provider circuit-breaker state and connection pool metrics
- Check for stream back-pressure issues — long-running streaming requests can inflate p99 latency
- If a specific provider is causing the delay, route to an alternative via Admin → Routing
ErrorRateSpike
Section titled “ErrorRateSpike”| Field | Value |
|---|---|
| Severity | critical |
| Fires when | Platform-wide HTTP 5xx error rate > 1% over 5 minutes |
| Expression | (sum(rate(http_request_errors_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01 |
| Pending duration | for: 5m |
The platform is returning 5xx responses on more than 1% of all requests. This indicates a systemic error — a deployment issue, dependency failure, or infrastructure problem.
Recommended actions:
- Check application logs for exception traces and error patterns
- Review recent deployments — consider rollback if errors began after a deploy
- Check database connectivity and query performance
- Verify all external dependencies (provider APIs, Redis, PostgreSQL) are healthy
- Review the System Health Grafana dashboard for correlated latency spikes
Alert summary table
Section titled “Alert summary table”| Alert | Severity | Threshold | Pending | Key label |
|---|---|---|---|---|
| DLPScanLatencyHigh | warning | p95 DLP > 500 ms | 5 m | — |
| ProviderErrorRateHigh | critical | provider errors > 5% | 5 m | provider |
| BudgetThreshold80 | warning | budget > 80% | immediate | org_id |
| BudgetThreshold95 | critical | budget > 95% | immediate | org_id |
| CertExpiryNear | warning | cert < 30 d remaining | immediate | domain |
| AuditChainBreak | critical | any chain break | immediate | — |
| InferenceTimeout | warning | p99 inference > 30 s | 5 m | — |
| ErrorRateSpike | critical | 5xx > 1% | 5 m | — |
See also
Section titled “See also”- Grafana dashboard catalog — dashboards that visualize the same metrics
- OpenTelemetry SDK integration — OTel tracing and metrics export
- Audit Log Verification — procedure for investigating AuditChainBreak alerts