Grafana dashboard catalog
Arbitex ships six Grafana dashboard JSON files in deploy/grafana/. Each dashboard targets a Prometheus data source and auto-refreshes every 30 seconds. All dashboards require Grafana 10.0.0 or later.
Importing dashboards
Section titled “Importing dashboards”- In Grafana, navigate to Dashboards → Import.
- Upload the JSON file from
deploy/grafana/or paste its contents. - When prompted, select your Prometheus data source for the
DS_PROMETHEUSinput. - Click Import.
Each dashboard uses the ${DS_PROMETHEUS} template variable to reference the data source — you can point multiple Arbitex environments at different Prometheus instances by importing the same JSON with a different data source selection.
Dashboard reference
Section titled “Dashboard reference”System Health (system-health.json)
Section titled “System Health (system-health.json)”UID: arbitex-system-health
Tags: arbitex, platform, system-health
Variable: $job — selects Prometheus scrape job(s); supports all-jobs wildcard.
Tracks platform-level HTTP health.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| Request Rate (by method) | Time series | http_requests_total by method | Per-method request rate (req/s). Useful for spotting unexpected changes in GET/POST ratios. |
| Request Latency Percentiles (p50/p95/p99) | Time series | http_request_duration_seconds_bucket | Latency percentiles across all endpoints. Threshold coloring: yellow >500 ms, red >2 s. |
| Error Rate (4xx/5xx) | Time series | http_requests_total by status_code | 4xx (orange) and 5xx (red) error rates. |
| Active Connections | Stat | http_active_connections | Current active HTTP connections. Threshold: yellow >500, red >1000. |
| 5xx Error Ratio | Stat | Derived from http_requests_total | 5xx errors as a fraction of all requests. Threshold: yellow >1%, red >5%. |
| Request Rate by Endpoint | Time series (stacked bars) | http_requests_total by endpoint | Per-endpoint traffic breakdown. Useful for identifying hot endpoints. |
DLP Analysis (dlp-analysis.json)
Section titled “DLP Analysis (dlp-analysis.json)”UID: arbitex-dlp-analysis
Tags: arbitex, platform, dlp
Variable: none (platform-wide).
Monitors DLP scan performance and trigger patterns.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| DLP Scan Latency p95 | Stat | dlp_scan_duration_seconds_bucket | Current p95 scan latency. Threshold: yellow >100 ms, red >500 ms. Ties to the DLPScanLatencyHigh alert. |
| DLP Scan Latency p99 | Stat | dlp_scan_duration_seconds_bucket | Current p99 scan latency. Threshold: yellow >250 ms, red >1 s. |
| DLP Trigger Rate (per min) | Stat | dlp_trigger_total | DLP triggers per minute. Threshold: yellow >10, red >50. |
| Scan Throughput (ops/s) | Stat | dlp_scans_total | Total scan operations per second. |
| DLP Scan Latency Distribution (p50/p95/p99) | Time series | dlp_scan_duration_seconds_bucket | Full latency curve over time. |
| DLP Trigger Rate by Tier | Time series | dlp_trigger_total by tier | Trigger rate split by detection tier: Tier 1 (Regex), Tier 2 (NER/GLiNER), Tier 3 (DeBERTa NLI), Presidio Bridge. High Tier 3 rates increase latency. |
| Entity Types Detected (Top Triggers) | Table | dlp_trigger_total by entity_type | Instant snapshot of which entity types are firing most frequently. Color-graded by trigger rate. |
| Scan Throughput vs Trigger Rate | Time series | dlp_scans_total, dlp_trigger_total | Side-by-side comparison of total scans versus trigger rate — reveals what fraction of traffic is triggering DLP rules. |
Provider Performance (provider-performance.json)
Section titled “Provider Performance (provider-performance.json)”UID: arbitex-provider-performance
Tags: arbitex, platform, providers
Variable: $provider — multi-select; filters all panels to selected provider(s).
Per-provider LLM performance metrics.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| Per-Provider Latency (p50/p95) | Time series | provider_request_duration_seconds_bucket by provider | Latency percentiles per provider. Useful for comparing upstream response times. |
| Per-Provider Error Rate | Time series | provider_errors_total by provider | Error rate per provider (errors/s). Ties to the ProviderErrorRateHigh alert. |
| Token Throughput by Provider (prompt/completion) | Time series | provider_tokens_total by provider, token_type | Prompt and completion token throughput split by provider. |
| Request Distribution by Model | Pie chart (donut) | provider_request_duration_seconds_count by model | Share of requests by model over the selected time range. |
| Provider / Model Summary Table | Table | Multi-query join | Per-provider, per-model table showing request rate (req_rate), error rate (error_rate, color-graded at 1%/5%), and p95 latency (p95_latency, color-graded at 2s/10s). |
Usage & Billing (usage-billing.json)
Section titled “Usage & Billing (usage-billing.json)”UID: arbitex-usage-billing
Tags: arbitex, platform, billing, usage
Variable: $org_id — multi-select; filters org-scoped panels to selected organization(s).
Tenant usage, token consumption, and budget utilization.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| Total Chat Request Rate | Stat | http_requests_total{endpoint=~"/api/chat.*"} | Platform-wide chat request rate (req/s). |
| Total Token Throughput (tokens/s) | Stat | provider_tokens_total | Combined prompt + completion tokens per second. |
| Rate Limit Rejections (per min) | Stat | rate_limit_rejections_total | Rate-limited requests per minute. Threshold: yellow >10, red >100. |
| Max Budget Utilization (any org) | Stat | budget_utilization_ratio | Highest budget utilization ratio across all orgs. Threshold: yellow >70%, red >90%. Ties to the BudgetThreshold80/95 alerts. |
| Request Volume by Org | Time series | http_requests_total by org_id | Per-org request volume over time. |
| Token Counts by Org (prompt/completion) | Time series | provider_tokens_total by org_id, token_type | Per-org prompt and completion token consumption. |
| Budget Utilization by Org (%) | Gauge | budget_utilization_ratio by org_id | Gauge visualization showing each org’s budget usage (0–100%). Threshold coloring: green <70%, yellow 70–90%, red >90%. |
| Rate Limit Hits by Org | Time series (bars) | rate_limit_rejections_total by org_id | Per-org rate limit rejections. |
Security Events (security-events.json)
Section titled “Security Events (security-events.json)”UID: arbitex-security-events
Tags: arbitex, platform, security
Variable: none (platform-wide).
Security signal monitoring — authentication failures, access controls, and anomaly detection.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| Auth Failures (per min) | Stat | auth_failures_total | Auth failures per minute. Threshold: yellow >5, red >20. |
| Rate Limit Rejections (per min) | Stat | rate_limit_rejections_total | Rate-limited requests per minute. |
| mTLS Verification Failures (per min) | Stat | mtls_verification_failures_total | Mutual TLS handshake failures. Threshold: yellow >1, red >5. |
| IP Allowlist Blocks (per min) | Stat | ip_allowlist_blocks_total | Requests blocked by IP allowlist. Threshold: yellow >1, red >10. |
| GeoIP Anon IP Detections (per min) | Stat | geoip_anonymous_ip_detections_total | Requests from VPN/Tor/proxy IPs detected by GeoIP enrichment. |
| Auth Failures (window total) | Stat | auth_failures_total (increase over range) | Cumulative auth failures in the selected time window. |
| Security Event Timeline | Time series | All five security metrics | Combined time series: Auth Failures (red), Rate Limit (orange), mTLS Failures (purple), IP Blocks (yellow), Anon IP (light-blue). |
| Auth Failures by Reason | Time series | auth_failures_total by reason | Failure rate broken down by failure reason (e.g., invalid token, expired token, missing key). |
| Auth Failure Summary (window) | Table | auth_failures_total by reason (increase) | Ranked table of failure counts by reason over the time window. |
Compliance (compliance.json)
Section titled “Compliance (compliance.json)”UID: arbitex-compliance
Tags: arbitex, platform, compliance
Variable: $framework — multi-select; filters policy violation panels to selected compliance frameworks.
Policy enforcement and audit integrity metrics for compliance reporting.
| Panel | Type | Metric(s) | Description |
|---|---|---|---|
| Policy Violations (per min) | Stat | policy_violations_total | Real-time policy violation rate. Threshold: yellow >1, red >10. |
| Audit Chain Breaks (window) | Stat | audit_chain_breaks_total (increase over range) | HMAC chain verification failures. Any non-zero value is red — ties to the AuditChainBreak alert. |
| Policy Pack Evaluations (ops/s) | Stat | policy_pack_evaluations_total | Policy pack evaluation throughput. |
| Policy Violations (window total) | Stat | policy_violations_total (increase over range) | Cumulative violations in the selected time window. |
| Policy Violations by Framework | Time series | policy_violations_total by framework | Violation rate per compliance framework (e.g., HIPAA, SOC2, GDPR). |
| DLP Entity Distribution (window) | Pie chart (donut) | dlp_trigger_total by entity_type | Distribution of detected entity types over the selected range. |
| Violation Counts by Framework (window) | Table | policy_violations_total by framework (increase) | Ranked table of violation counts per framework. |
| Policy Pack Violation Rate by Pack | Bar gauge | policy_pack_evaluations_total by pack_name | Fraction of evaluations resulting in a violation, per pack. Threshold: green <50%, yellow 50–80%, red >80%. |
Customizing dashboards
Section titled “Customizing dashboards”All six dashboards are editable in Grafana (the JSON files set "editable": true). To extend a dashboard without losing upstream updates, use Grafana’s built-in copy feature:
- Open the dashboard.
- Click the gear icon → Save As — enter a new title (e.g., “Arbitex — DLP Analysis (Custom)”).
- The copy is stored in Grafana’s database and survives re-imports of the original file.
See also
Section titled “See also”- OpenTelemetry SDK integration — how to enable OTel tracing and metrics export
- Prometheus alerting reference — alert rules that fire on these metrics