Prometheus alerting rules reference

Arbitex ships eight Prometheus alerting rules in deploy/prometheus/alert-rules.yml under the group arbitex.platform. Load this file into your Prometheus instance using the rule_files directive or a Kubernetes PrometheusRule CRD.

Loading the rules

Standalone Prometheus

Add to prometheus.yml:

rule_files:
  - /etc/prometheus/rules/alert-rules.yml

Copy deploy/prometheus/alert-rules.yml to /etc/prometheus/rules/ on your Prometheus host and reload:

curl -X POST http://localhost:9090/-/reload

Kubernetes with Prometheus Operator

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: arbitex-platform
  namespace: monitoring
  labels:
    prometheus: kube-prometheus    # match your Prometheus selector
spec:
  groups:
    # paste the contents of alert-rules.yml groups array here

Alert rules

DLPScanLatencyHigh

Field	Value
Severity	`warning`
Fires when	p95 DLP scan duration > 500 ms for 5 consecutive minutes
Expression	`histogram_quantile(0.95, sum(rate(dlp_scan_duration_seconds_bucket[5m])) by (le)) > 0.5`
Pending duration	`for: 5m`

The 95th-percentile DLP scan duration has exceeded 500 ms. This typically indicates NER microservice (GLiNER) saturation or GPU resource pressure.

Recommended actions:

Check the NER microservice pod health: kubectl get pods -n arbitex | grep ner
Review GPU utilization on NER nodes (check DCGM exporter metrics or cloud GPU dashboards)
If running Tier 3 (DeBERTa NLI) scans, evaluate whether the DeBERTa threshold can be raised to reduce Tier 3 invocations
Consider horizontal scaling of the NER deployment

ProviderErrorRateHigh

Field	Value
Severity	`critical`
Fires when	Error rate for any single provider > 5% over 5 minutes
Expression	`(sum(rate(provider_errors_total[5m])) by (provider) / sum(rate(provider_requests_total[5m])) by (provider)) > 0.05`
Pending duration	`for: 5m`
Labels	`provider` label identifies the affected provider

A specific LLM provider is returning errors on more than 5% of requests. The provider label in the alert identifies which provider is affected (e.g., anthropic, openai, azure_openai).

Recommended actions:

Check the provider’s status page for incidents
Verify the provider API key is valid and has not expired
Review circuit-breaker state in the Arbitex admin UI (Admin → Routing)
If the error rate is sustained, enable provider failover routing to an alternative provider

BudgetThreshold80

Field	Value
Severity	`warning`
Fires when	Any organization’s budget utilization ratio > 0.80 (80%) — fires immediately
Expression	`budget_utilization_ratio > 0.80`
Pending duration	`for: 0m` (immediate)
Labels	`org_id` label identifies the affected organization

An organization has consumed more than 80% of its current billing-period budget.

Recommended actions:

Review usage trends in the Usage & Billing Grafana dashboard
Contact the tenant to discuss usage patterns or raise the budget limit
Consider setting a soft rate limit to slow consumption if the tenant is approaching exhaustion

BudgetThreshold95

Field	Value
Severity	`critical`
Fires when	Any organization’s budget utilization ratio > 0.95 (95%) — fires immediately
Expression	`budget_utilization_ratio > 0.95`
Pending duration	`for: 0m` (immediate)
Labels	`org_id` label identifies the affected organization

An organization has consumed more than 95% of its billing-period budget. Service interruption may occur if the budget is fully exhausted.

Recommended actions:

Raise the organization’s budget limit immediately if warranted
If the usage is unexpected (possible API key leak or runaway automation), suspend the API key via Admin → API Keys
Review the Credential Intelligence dashboard for leaked key indicators
Notify the tenant — this fires in real time

CertExpiryNear

Field	Value
Severity	`warning`
Fires when	Any TLS certificate expires in less than 30 days — fires immediately
Expression	`tls_cert_expiry_seconds < (30 * 24 * 3600)`
Pending duration	`for: 0m` (immediate)
Labels	`domain` label identifies the affected domain

A TLS certificate is expiring within 30 days.

Recommended actions:

Trigger certificate renewal via your ACME client (cert-manager, certbot, or cloud provider ACM/Let’s Encrypt)
Verify the renewed certificate is deployed and the old one replaced
For Outpost deployments using self-managed certificates, follow the Air-Gap Operations guide certificate rotation procedure

AuditChainBreak

Field	Value
Severity	`critical`
Fires when	Any HMAC audit chain verification failure occurs within a 5-minute window — fires immediately
Expression	`increase(audit_chain_breaks_total[5m]) > 0`
Pending duration	`for: 0m` (immediate)

One or more audit log HMAC chain verification failures have been recorded. This may indicate log tampering or an interrupted key rotation.

This alert is a compliance-critical indicator. Treat any firing as a potential integrity incident.

Recommended actions:

Immediately freeze the audit log to prevent further writes if tampering is suspected
Run the audit log verification procedure: see Audit Log Verification
Check recent key rotation activity — an incomplete rotation can produce false chain breaks
Preserve log artifacts for forensic review before any remediation

InferenceTimeout

Field	Value
Severity	`warning`
Fires when	p99 inference latency for `/api/chat*` endpoints > 30 s over 5 minutes
Expression	`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{handler=~"/api/chat.*"}[5m])) by (le)) > 30`
Pending duration	`for: 5m`

The 99th-percentile inference request duration has exceeded 30 seconds. This indicates severely degraded end-to-end response times for users, likely caused by slow provider responses, stream back-pressure, or connection pool exhaustion.

Recommended actions:

Check provider latency in the Provider Performance Grafana dashboard
Review provider circuit-breaker state and connection pool metrics
Check for stream back-pressure issues — long-running streaming requests can inflate p99 latency
If a specific provider is causing the delay, route to an alternative via Admin → Routing

ErrorRateSpike

Field	Value
Severity	`critical`
Fires when	Platform-wide HTTP 5xx error rate > 1% over 5 minutes
Expression	`(sum(rate(http_request_errors_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01`
Pending duration	`for: 5m`

The platform is returning 5xx responses on more than 1% of all requests. This indicates a systemic error — a deployment issue, dependency failure, or infrastructure problem.

Recommended actions:

Check application logs for exception traces and error patterns
Review recent deployments — consider rollback if errors began after a deploy
Check database connectivity and query performance
Verify all external dependencies (provider APIs, Redis, PostgreSQL) are healthy
Review the System Health Grafana dashboard for correlated latency spikes

Alert summary table

Alert	Severity	Threshold	Pending	Key label
DLPScanLatencyHigh	warning	p95 DLP > 500 ms	5 m	—
ProviderErrorRateHigh	critical	provider errors > 5%	5 m	`provider`
BudgetThreshold80	warning	budget > 80%	immediate	`org_id`
BudgetThreshold95	critical	budget > 95%	immediate	`org_id`
CertExpiryNear	warning	cert < 30 d remaining	immediate	`domain`
AuditChainBreak	critical	any chain break	immediate	—
InferenceTimeout	warning	p99 inference > 30 s	5 m	—
ErrorRateSpike	critical	5xx > 1%	5 m	—

Prometheus alerting rules reference

Loading the rules

Standalone Prometheus

Kubernetes with Prometheus Operator

Alert rules

DLPScanLatencyHigh

ProviderErrorRateHigh

BudgetThreshold80

BudgetThreshold95

CertExpiryNear

AuditChainBreak

InferenceTimeout

ErrorRateSpike

Alert summary table

See also