OpenTelemetry configuration guide
The Arbitex platform ships an opt-in OpenTelemetry (OTel) SDK integration (platform-0046). When enabled, the platform exports distributed traces and metrics to any OTLP-compatible backend. This guide walks through enabling OTel, connecting to common backends, using the Grafana dashboards, and diagnosing configuration problems.
For the internal SDK architecture reference (TracerProvider internals, ContextVar bridge, Python package requirements), see OpenTelemetry SDK integration.
Prerequisites
Section titled “Prerequisites”- Arbitex platform
0046+(platform-0046 ships the OTel SDK). - An OTLP-compatible backend (Jaeger, Grafana Tempo, Grafana Cloud, Datadog Agent, Honeycomb, or any OTel Collector).
- Docker image built with
OTEL_EXTRAS=1or optional OTel packages installed in your environment (see Required packages).
Quick start
Section titled “Quick start”OTel is disabled by default. To enable it, set one environment variable:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317That is all that is required for a working integration. When set, the platform initializes the TracerProvider and MeterProvider at startup and begins exporting traces and metrics to the endpoint.
Verify by checking the startup log:
OTel initialized: service=arbitex-platform, endpoint=http://otel-collector:4317Environment variables
Section titled “Environment variables”| Variable | Required | Default | Description |
|---|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | Yes | — | gRPC OTLP exporter endpoint, e.g. http://otel-collector:4317. Setting this variable activates OTel OTLP export. Prometheus /metrics scraping is always available regardless of this setting. |
OTEL_SERVICE_NAME | No | arbitex-platform | Service name on all spans and metrics (service.name resource attribute). |
OTEL_RESOURCE_ATTRIBUTES | No | — | Additional resource attributes in key=value,key=value format appended to all telemetry. |
OTEL_TRACES_SAMPLER | No | parentbased_always_on | Trace sampling strategy. Set to traceidratio with OTEL_TRACES_SAMPLER_ARG to reduce span volume. |
OTEL_TRACES_SAMPLER_ARG | No | 1.0 | Sampling ratio when using traceidratio sampler. Example: 0.1 = 10% of traces sampled. |
Metric export interval
Section titled “Metric export interval”The OTLP metric exporter uses a PeriodicExportingMetricReader with a fixed 60-second flush interval (export_interval_millis=60000, set in backend/app/core/telemetry.py). This interval is not currently configurable via environment variable. The Prometheus metric reader exports on-demand (every scrape, typically 15–30 seconds).
Log export
Section titled “Log export”When OTEL_EXPORTER_OTLP_ENDPOINT is set, the platform also exports Python WARNING-and-above log records to the OTLP backend via the OTel Logs SDK (BatchLogRecordProcessor). This enables log-trace correlation in backends that support OTLP logs (Grafana Tempo + Loki, Honeycomb, Datadog). Log records include trace_id and span_id fields for correlation.
Log export uses the same gRPC endpoint as traces and metrics — there is no separate log endpoint.
Use an https:// prefix for TLS-terminated endpoints:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317The gRPC exporter uses insecure mode when the endpoint starts with http://. Always use TLS in production.
Deployment configuration
Section titled “Deployment configuration”Docker Compose
Section titled “Docker Compose”services: api: image: arbitex-platform:latest environment: OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317" OTEL_SERVICE_NAME: "arbitex-platform" OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production"Kubernetes (Helm)
Section titled “Kubernetes (Helm)”api: env: OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317" OTEL_SERVICE_NAME: "arbitex-platform" OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production,k8s.namespace=arbitex"Outpost deployments
Section titled “Outpost deployments”Set the same variables in the outpost environment. Point the OTLP endpoint at a collector co-located in your private network — do not route outpost spans through the public Arbitex cloud endpoint.
# outpost values-prod.yamlenv: OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.corp.internal:4317" OTEL_SERVICE_NAME: "arbitex-outpost"Auto-instrumented services
Section titled “Auto-instrumented services”The OTel SDK auto-instruments three libraries at startup. No code changes are required.
| Service | What is traced |
|---|---|
| FastAPI | Every HTTP request/response — method, path, status code, duration |
| httpx | Every outbound HTTP call — provider API calls, webhook deliveries, management plane syncs |
| SQLAlchemy | Every database query — table, operation, duration |
Auto-instrumentation is silently skipped for any package that is not installed. If you see missing spans for database queries, verify that opentelemetry-instrumentation-sqlalchemy is installed.
Span naming
Section titled “Span naming”FastAPI spans are named {method} {path_template} — for example, POST /api/chat/completions. httpx spans use the target host and path. SQLAlchemy spans include the table name and operation.
Custom metrics
Section titled “Custom metrics”The platform emits a set of custom metrics via the OTel MeterProvider. These are in addition to the Prometheus metrics exposed at /metrics. Both sets share the same metric names when OTel is enabled — the MeterProvider exports them every 60 seconds via PeriodicExportingMetricReader.
Key platform metrics exported to OTel:
| Metric | Type | Description |
|---|---|---|
arbitex.requests.total | Counter | Total AI requests processed |
arbitex.dlp.scan_duration_seconds | Histogram | DLP pipeline scan latency per tier |
arbitex.provider.latency_seconds | Histogram | AI provider response latency by provider |
arbitex.policy.rule_evaluations_total | Counter | Policy rule evaluations by action type |
arbitex.override.count | Counter | ALLOW_WITH_OVERRIDE overrides by rule |
arbitex.budget.utilization_ratio | Gauge | Budget utilization fraction (0.0–1.0) |
All metrics include service.name, deployment.environment, and any OTEL_RESOURCE_ATTRIBUTES as labels.
Backend configuration
Section titled “Backend configuration”Jaeger
Section titled “Jaeger”Jaeger 1.35+ supports native OTLP gRPC on port 4317.
- Start Jaeger with OTLP enabled:
docker run --rm -p 16686:16686 -p 4317:4317 \ jaegertracing/all-in-one:latest \ --collector.otlp.grpc.host-port=:4317- Set the platform environment variable:
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317OTEL_SERVICE_NAME=arbitex-platform- Open the Jaeger UI at
http://localhost:16686. Select servicearbitex-platformand search for recent traces.
Kubernetes: Deploy Jaeger via its Helm chart and point the endpoint at the collector service.
Grafana Cloud (Tempo + Prometheus)
Section titled “Grafana Cloud (Tempo + Prometheus)”Grafana Cloud provides hosted Tempo (traces) and Prometheus (metrics) with OTLP ingestion endpoints.
- In Grafana Cloud, navigate to Connections > Add new connection > OpenTelemetry.
- Note the OTLP endpoint and generate an API token with write access to Tempo and Prometheus.
- Configure the platform:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net:443OTEL_SERVICE_NAME=arbitex-platformOTEL_RESOURCE_ATTRIBUTES=deployment.environment=productionBecause Grafana Cloud requires authentication, you must configure the exporter headers. Set these as additional environment variables:
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64(instance_id:api_token)>"Replace <base64(instance_id:api_token)> with your base64-encoded Grafana Cloud credentials. The exact header format is shown in the Grafana Cloud connection wizard.
- In Grafana Cloud, open Explore > Tempo to browse traces. Open Explore > Prometheus to query platform metrics.
Datadog
Section titled “Datadog”The Datadog Agent 7.x+ ships an OTLP receiver. Enable it in datadog.yaml:
otlp_config: receiver: protocols: grpc: endpoint: "0.0.0.0:4317"Or with the Helm chart:
datadog: otlp: receiver: protocols: grpc: enabled: true endpoint: "0.0.0.0:4317"Configure the platform:
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317OTEL_SERVICE_NAME=arbitex-platformOTEL_RESOURCE_ATTRIBUTES=deployment.environment=productionTraces appear in Datadog APM under APM > Services > arbitex-platform. Metrics appear in the Metrics Explorer with the otel.* namespace prefix.
Note: Datadog requires the Agent to be running and reachable. If the Agent is not on the same host/network, adjust the endpoint hostname.
Generic OTel Collector
Section titled “Generic OTel Collector”For any other backend, deploy an OTel Collector in front of your backend and point the platform at the collector’s OTLP gRPC receiver (default port 4317).
Example minimal collector configuration:
receivers: otlp: protocols: grpc: endpoint: "0.0.0.0:4317"
exporters: otlp/tempo: endpoint: "tempo:4317" tls: insecure: true prometheusremotewrite: endpoint: "http://prometheus:9090/api/v1/write"
service: pipelines: traces: receivers: [otlp] exporters: [otlp/tempo] metrics: receivers: [otlp] exporters: [prometheusremotewrite]Grafana dashboards
Section titled “Grafana dashboards”Arbitex ships six pre-built Grafana dashboards. Import them from the observability/grafana/ directory in the platform repository. All dashboards use the arbitex_ Prometheus metric prefix and the arbitex-platform service name in Tempo.
Dashboard catalog
Section titled “Dashboard catalog”1. Request Volume & Latency
Top-level overview of AI request throughput and end-to-end latency. Key panels: requests/second (counter rate), p50/p95/p99 provider latency histograms, error rate by provider, requests by model. Use this as the primary SLO dashboard.
2. DLP Pipeline
Data loss prevention pipeline performance. Panels: DLP scan duration histogram (arbitex.dlp.scan_duration_seconds) by tier (basic/standard/enterprise), entity detection hit rate by entity type, tier escalation rate, DLP-blocked request count. Useful for tuning DLP performance and identifying scan bottlenecks.
3. Policy Engine
Policy rule evaluation metrics. Panels: evaluations per second by action type (ALLOW/BLOCK/PROMPT/REDACT/ALLOW_WITH_OVERRIDE), override count by rule (arbitex.override.count), average policy evaluation duration (eval_duration_ms), top 10 rules by fire rate. Use to audit governance action frequency.
4. Provider Health
Real-time provider availability and error tracking. Panels: provider request success rate, error breakdown by HTTP status code (429 rate limit, 5xx errors), provider latency comparison, active provider kill switch status. Alert when any provider success rate drops below threshold.
5. Budget & Quota Utilization
Budget and quota consumption across the platform. Panels: budget utilization ratio (arbitex.budget.utilization_ratio) by org/tenant, quota usage by user/group, throttle event rate (80%/95%/100% threshold crossings), projected overage timeline. Links to the usage dashboard for drilldown.
6. Infrastructure
Platform infrastructure health. Panels: API process CPU and memory, database connection pool saturation, Redis hit rate, active SSE connections, background task queue depth, worker thread count. Use for capacity planning and incident investigation.
Navigating traces from dashboard panels
Section titled “Navigating traces from dashboard panels”- Open the DLP Pipeline Latency dashboard panel.
- Click a spike in the
arbitex.dlp.scan_duration_secondshistogram. - In the drill-down view, select View traces for this time window (requires Grafana Tempo data source linked to the same trace backend).
- The Explore panel shows matching spans. Click a span to see the full trace: inbound HTTP → DLP scan → provider call.
Log-to-trace correlation
Section titled “Log-to-trace correlation”When OTel is enabled, every structured log line written during a request includes a trace_id field that matches the OTel-generated trace ID. In Grafana Logs (Loki), use the Derive fields feature to create a link from trace_id log fields to the Tempo trace:
- In the Loki data source settings, add a derived field:
- Name:
trace_id - Regex:
"trace_id":"([0-9a-f]{32})" - URL:
${__value.raw}(Tempo data source)
- Name:
- Log lines in Explore will show a Jump to trace link next to each
trace_idvalue.
Prometheus alert rules
Section titled “Prometheus alert rules”Arbitex ships eight Prometheus alert rules in observability/prometheus/arbitex_alerts.yml. Import or include them in your Prometheus configuration alongside the scrape_configs targeting /metrics.
Alert catalog
Section titled “Alert catalog”1. ArbitexHighErrorRate
expr: rate(arbitex_request_errors_total[5m]) / rate(arbitex_requests_total[5m]) > 0.05for: 2mseverity: warningFires when the platform-wide request error rate exceeds 5% for 2 consecutive minutes. Common causes: provider outage, authentication misconfiguration, or downstream rate limiting.
2. ArbitexProviderDown
expr: arbitex_provider_success_rate < 0.5for: 1mseverity: criticalFires when any individual AI provider’s success rate drops below 50% for 1 minute. Suggests a provider API outage or kill switch activation.
3. ArbitexDLPScanLatencyHigh
expr: histogram_quantile(0.95, rate(arbitex_dlp_scan_duration_seconds_bucket[5m])) > 2.0for: 5mseverity: warningFires when the 95th-percentile DLP scan duration exceeds 2 seconds for 5 minutes. Indicates NER/GLiNER model slowness or resource contention on the scan workers.
4. ArbitexBudgetNearLimit
expr: arbitex_budget_utilization_ratio > 0.9for: 10mseverity: warningFires when any tenant/org budget utilization ratio exceeds 90% for 10 minutes. Corresponds to the platform’s 95% throttle tier. Coordinate with the tenant to review quota allocation.
5. ArbitexOverrideRateHigh
expr: rate(arbitex_override_count_total[1h]) > 10for: 30mseverity: infoFires when governance override events (ALLOW_WITH_OVERRIDE or PROMPT confirmations) exceed 10/hour for 30 minutes. May indicate a policy rule is too broadly configured, generating unnecessary interruptions.
6. ArbitexPolicyEvalLatencyHigh
expr: histogram_quantile(0.99, rate(arbitex_policy_eval_duration_ms_bucket[5m])) > 100for: 5mseverity: warningFires when policy evaluation p99 latency exceeds 100ms. Policy evaluation runs in the hot path of every request; high latency here directly impacts end-user response time.
7. ArbitexOutpostHeartbeatMissed
expr: time() - arbitex_outpost_last_heartbeat_timestamp > 300for: 0mseverity: criticalFires immediately when an outpost has not sent a heartbeat in more than 5 minutes (2.5× the 120-second heartbeat interval). Indicates the outpost is unreachable, crashed, or has lost network connectivity.
8. ArbitexAuditQueueDepthHigh
expr: arbitex_audit_queue_depth > 1000for: 5mseverity: warningFires when the audit log delivery queue depth exceeds 1000 events for 5 minutes. May indicate SIEM connector delivery failures or a downstream SIEM backpressure. Check the SIEM health endpoint (GET /api/admin/siem/health) and connector logs.
Troubleshooting
Section titled “Troubleshooting”OTel does not initialize
Section titled “OTel does not initialize”Symptom: Log line OTel disabled (OTEL_EXPORTER_OTLP_ENDPOINT not set) instead of the initialized message.
Fix: Verify OTEL_EXPORTER_OTLP_ENDPOINT is set in the container environment. Check for typos. The variable must be present at startup — it cannot be set dynamically after the process starts.
OTel packages not installed
Section titled “OTel packages not installed”Symptom: Log line OTel packages not installed, skipping initialization.
Fix: The optional OTel packages are not in the Python environment. Rebuild the Docker image with OTEL_EXTRAS=1:
docker build --build-arg OTEL_EXTRAS=1 -t arbitex-platform:latest .Or install the packages manually:
pip install \ opentelemetry-sdk \ opentelemetry-exporter-otlp-proto-grpc \ opentelemetry-instrumentation-fastapi \ opentelemetry-instrumentation-httpx \ opentelemetry-instrumentation-sqlalchemyNo spans appearing in backend
Section titled “No spans appearing in backend”Symptom: OTel initializes but traces do not appear in Jaeger/Grafana/Datadog.
Checks:
- Verify the collector/backend is reachable from the platform container. Run a connectivity test:
Terminal window docker exec arbitex-api curl -v telnet://otel-collector:4317 - Check for TLS mismatch: if the endpoint uses
http://but the collector requires TLS, switch tohttps://. - Check for authentication issues (Grafana Cloud, Datadog): confirm the
OTEL_EXPORTER_OTLP_HEADERSvalue is correct. - The
BatchSpanProcessorbuffers spans and flushes them periodically. Send a few requests to generate spans. Wait up to 10 seconds for the first flush.
Traces appear but metrics do not
Section titled “Traces appear but metrics do not”Symptom: Traces are visible but no metrics arrive at the backend.
Checks:
- Metrics are flushed every 60 seconds by
PeriodicExportingMetricReader. Wait at least 90 seconds before concluding metrics are absent. - Confirm the backend accepts metrics on the same OTLP gRPC endpoint as traces. Some backends use separate ports or endpoints for metrics vs. traces — if so, configure a Collector that fans out from a single receiver.
Missing database query spans
Section titled “Missing database query spans”Symptom: FastAPI and httpx spans are visible but no SQLAlchemy spans.
Fix: Verify opentelemetry-instrumentation-sqlalchemy is installed in the environment. Check the startup log for a warning like SQLAlchemy instrumentation skipped: package not found.
trace_id not appearing in log lines
Section titled “trace_id not appearing in log lines”Symptom: OTel is enabled but log lines do not include trace_id.
Cause: The _ContextVarBridgeProcessor writes the OTel trace ID into the legacy observability.py ContextVars, which are read by the structured logger. If trace_id is absent in logs, the bridge processor may not be active — verify the platform version is 0046+.
See also
Section titled “See also”- OpenTelemetry SDK integration — internal architecture: TracerProvider, MeterProvider, ContextVar bridge, graceful shutdown
- Grafana dashboard catalog — pre-built dashboards for DLP, provider latency, and request volume
- Prometheus alerting reference — alert rules for DLP, providers, and infrastructure
- Outpost health monitoring — outpost-specific health signals