Grafana dashboard catalog

Arbitex ships six Grafana dashboard JSON files in deploy/grafana/. Each dashboard targets a Prometheus data source and auto-refreshes every 30 seconds. All dashboards require Grafana 10.0.0 or later.

Importing dashboards

In Grafana, navigate to Dashboards → Import.
Upload the JSON file from deploy/grafana/ or paste its contents.
When prompted, select your Prometheus data source for the DS_PROMETHEUS input.
Click Import.

Each dashboard uses the ${DS_PROMETHEUS} template variable to reference the data source — you can point multiple Arbitex environments at different Prometheus instances by importing the same JSON with a different data source selection.

Dashboard reference

System Health (`system-health.json`)

UID: arbitex-system-health Tags: arbitex, platform, system-health Variable: $job — selects Prometheus scrape job(s); supports all-jobs wildcard.

Tracks platform-level HTTP health.

Panel	Type	Metric(s)	Description
Request Rate (by method)	Time series	`http_requests_total` by `method`	Per-method request rate (req/s). Useful for spotting unexpected changes in GET/POST ratios.
Request Latency Percentiles (p50/p95/p99)	Time series	`http_request_duration_seconds_bucket`	Latency percentiles across all endpoints. Threshold coloring: yellow >500 ms, red >2 s.
Error Rate (4xx/5xx)	Time series	`http_requests_total` by `status_code`	4xx (orange) and 5xx (red) error rates.
Active Connections	Stat	`http_active_connections`	Current active HTTP connections. Threshold: yellow >500, red >1000.
5xx Error Ratio	Stat	Derived from `http_requests_total`	5xx errors as a fraction of all requests. Threshold: yellow >1%, red >5%.
Request Rate by Endpoint	Time series (stacked bars)	`http_requests_total` by `endpoint`	Per-endpoint traffic breakdown. Useful for identifying hot endpoints.

DLP Analysis (`dlp-analysis.json`)

UID: arbitex-dlp-analysis Tags: arbitex, platform, dlp Variable: none (platform-wide).

Monitors DLP scan performance and trigger patterns.

Panel	Type	Metric(s)	Description
DLP Scan Latency p95	Stat	`dlp_scan_duration_seconds_bucket`	Current p95 scan latency. Threshold: yellow >100 ms, red >500 ms. Ties to the `DLPScanLatencyHigh` alert.
DLP Scan Latency p99	Stat	`dlp_scan_duration_seconds_bucket`	Current p99 scan latency. Threshold: yellow >250 ms, red >1 s.
DLP Trigger Rate (per min)	Stat	`dlp_trigger_total`	DLP triggers per minute. Threshold: yellow >10, red >50.
Scan Throughput (ops/s)	Stat	`dlp_scans_total`	Total scan operations per second.
DLP Scan Latency Distribution (p50/p95/p99)	Time series	`dlp_scan_duration_seconds_bucket`	Full latency curve over time.
DLP Trigger Rate by Tier	Time series	`dlp_trigger_total` by `tier`	Trigger rate split by detection tier: Tier 1 (Regex), Tier 2 (NER/GLiNER), Tier 3 (DeBERTa NLI), Presidio Bridge. High Tier 3 rates increase latency.
Entity Types Detected (Top Triggers)	Table	`dlp_trigger_total` by `entity_type`	Instant snapshot of which entity types are firing most frequently. Color-graded by trigger rate.
Scan Throughput vs Trigger Rate	Time series	`dlp_scans_total`, `dlp_trigger_total`	Side-by-side comparison of total scans versus trigger rate — reveals what fraction of traffic is triggering DLP rules.

Provider Performance (`provider-performance.json`)

UID: arbitex-provider-performance Tags: arbitex, platform, providers Variable: $provider — multi-select; filters all panels to selected provider(s).

Per-provider LLM performance metrics.

Panel	Type	Metric(s)	Description
Per-Provider Latency (p50/p95)	Time series	`provider_request_duration_seconds_bucket` by `provider`	Latency percentiles per provider. Useful for comparing upstream response times.
Per-Provider Error Rate	Time series	`provider_errors_total` by `provider`	Error rate per provider (errors/s). Ties to the `ProviderErrorRateHigh` alert.
Token Throughput by Provider (prompt/completion)	Time series	`provider_tokens_total` by `provider`, `token_type`	Prompt and completion token throughput split by provider.
Request Distribution by Model	Pie chart (donut)	`provider_request_duration_seconds_count` by `model`	Share of requests by model over the selected time range.
Provider / Model Summary Table	Table	Multi-query join	Per-provider, per-model table showing request rate (`req_rate`), error rate (`error_rate`, color-graded at 1%/5%), and p95 latency (`p95_latency`, color-graded at 2s/10s).

Usage & Billing (`usage-billing.json`)

UID: arbitex-usage-billing Tags: arbitex, platform, billing, usage Variable: $org_id — multi-select; filters org-scoped panels to selected organization(s).

Tenant usage, token consumption, and budget utilization.

Panel	Type	Metric(s)	Description
Total Chat Request Rate	Stat	`http_requests_total{endpoint=~"/api/chat.*"}`	Platform-wide chat request rate (req/s).
Total Token Throughput (tokens/s)	Stat	`provider_tokens_total`	Combined prompt + completion tokens per second.
Rate Limit Rejections (per min)	Stat	`rate_limit_rejections_total`	Rate-limited requests per minute. Threshold: yellow >10, red >100.
Max Budget Utilization (any org)	Stat	`budget_utilization_ratio`	Highest budget utilization ratio across all orgs. Threshold: yellow >70%, red >90%. Ties to the `BudgetThreshold80/95` alerts.
Request Volume by Org	Time series	`http_requests_total` by `org_id`	Per-org request volume over time.
Token Counts by Org (prompt/completion)	Time series	`provider_tokens_total` by `org_id`, `token_type`	Per-org prompt and completion token consumption.
Budget Utilization by Org (%)	Gauge	`budget_utilization_ratio` by `org_id`	Gauge visualization showing each org’s budget usage (0–100%). Threshold coloring: green <70%, yellow 70–90%, red >90%.
Rate Limit Hits by Org	Time series (bars)	`rate_limit_rejections_total` by `org_id`	Per-org rate limit rejections.

Security Events (`security-events.json`)

UID: arbitex-security-events Tags: arbitex, platform, security Variable: none (platform-wide).

Security signal monitoring — authentication failures, access controls, and anomaly detection.

Panel	Type	Metric(s)	Description
Auth Failures (per min)	Stat	`auth_failures_total`	Auth failures per minute. Threshold: yellow >5, red >20.
Rate Limit Rejections (per min)	Stat	`rate_limit_rejections_total`	Rate-limited requests per minute.
mTLS Verification Failures (per min)	Stat	`mtls_verification_failures_total`	Mutual TLS handshake failures. Threshold: yellow >1, red >5.
IP Allowlist Blocks (per min)	Stat	`ip_allowlist_blocks_total`	Requests blocked by IP allowlist. Threshold: yellow >1, red >10.
GeoIP Anon IP Detections (per min)	Stat	`geoip_anonymous_ip_detections_total`	Requests from VPN/Tor/proxy IPs detected by GeoIP enrichment.
Auth Failures (window total)	Stat	`auth_failures_total` (increase over range)	Cumulative auth failures in the selected time window.
Security Event Timeline	Time series	All five security metrics	Combined time series: Auth Failures (red), Rate Limit (orange), mTLS Failures (purple), IP Blocks (yellow), Anon IP (light-blue).
Auth Failures by Reason	Time series	`auth_failures_total` by `reason`	Failure rate broken down by failure reason (e.g., invalid token, expired token, missing key).
Auth Failure Summary (window)	Table	`auth_failures_total` by `reason` (increase)	Ranked table of failure counts by reason over the time window.

Compliance (`compliance.json`)

UID: arbitex-compliance Tags: arbitex, platform, compliance Variable: $framework — multi-select; filters policy violation panels to selected compliance frameworks.

Policy enforcement and audit integrity metrics for compliance reporting.

Panel	Type	Metric(s)	Description
Policy Violations (per min)	Stat	`policy_violations_total`	Real-time policy violation rate. Threshold: yellow >1, red >10.
Audit Chain Breaks (window)	Stat	`audit_chain_breaks_total` (increase over range)	HMAC chain verification failures. Any non-zero value is red — ties to the `AuditChainBreak` alert.
Policy Pack Evaluations (ops/s)	Stat	`policy_pack_evaluations_total`	Policy pack evaluation throughput.
Policy Violations (window total)	Stat	`policy_violations_total` (increase over range)	Cumulative violations in the selected time window.
Policy Violations by Framework	Time series	`policy_violations_total` by `framework`	Violation rate per compliance framework (e.g., HIPAA, SOC2, GDPR).
DLP Entity Distribution (window)	Pie chart (donut)	`dlp_trigger_total` by `entity_type`	Distribution of detected entity types over the selected range.
Violation Counts by Framework (window)	Table	`policy_violations_total` by `framework` (increase)	Ranked table of violation counts per framework.
Policy Pack Violation Rate by Pack	Bar gauge	`policy_pack_evaluations_total` by `pack_name`	Fraction of evaluations resulting in a violation, per pack. Threshold: green <50%, yellow 50–80%, red >80%.

Customizing dashboards

All six dashboards are editable in Grafana (the JSON files set "editable": true). To extend a dashboard without losing upstream updates, use Grafana’s built-in copy feature:

Open the dashboard.
Click the gear icon → Save As — enter a new title (e.g., “Arbitex — DLP Analysis (Custom)”).
The copy is stored in Grafana’s database and survives re-imports of the original file.

Grafana dashboard catalog

Importing dashboards

Dashboard reference

System Health (system-health.json)

DLP Analysis (dlp-analysis.json)

Provider Performance (provider-performance.json)

Usage & Billing (usage-billing.json)

Security Events (security-events.json)

Compliance (compliance.json)