Skip to content

OpenTelemetry configuration guide

The Arbitex platform ships an opt-in OpenTelemetry (OTel) SDK integration (platform-0046). When enabled, the platform exports distributed traces and metrics to any OTLP-compatible backend. This guide walks through enabling OTel, connecting to common backends, using the Grafana dashboards, and diagnosing configuration problems.

For the internal SDK architecture reference (TracerProvider internals, ContextVar bridge, Python package requirements), see OpenTelemetry SDK integration.


  • Arbitex platform 0046+ (platform-0046 ships the OTel SDK).
  • An OTLP-compatible backend (Jaeger, Grafana Tempo, Grafana Cloud, Datadog Agent, Honeycomb, or any OTel Collector).
  • Docker image built with OTEL_EXTRAS=1 or optional OTel packages installed in your environment (see Required packages).

OTel is disabled by default. To enable it, set one environment variable:

Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

That is all that is required for a working integration. When set, the platform initializes the TracerProvider and MeterProvider at startup and begins exporting traces and metrics to the endpoint.

Verify by checking the startup log:

OTel initialized: service=arbitex-platform, endpoint=http://otel-collector:4317

VariableRequiredDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINTYesgRPC OTLP exporter endpoint, e.g. http://otel-collector:4317. Setting this variable activates OTel OTLP export. Prometheus /metrics scraping is always available regardless of this setting.
OTEL_SERVICE_NAMENoarbitex-platformService name on all spans and metrics (service.name resource attribute).
OTEL_RESOURCE_ATTRIBUTESNoAdditional resource attributes in key=value,key=value format appended to all telemetry.
OTEL_TRACES_SAMPLERNoparentbased_always_onTrace sampling strategy. Set to traceidratio with OTEL_TRACES_SAMPLER_ARG to reduce span volume.
OTEL_TRACES_SAMPLER_ARGNo1.0Sampling ratio when using traceidratio sampler. Example: 0.1 = 10% of traces sampled.

The OTLP metric exporter uses a PeriodicExportingMetricReader with a fixed 60-second flush interval (export_interval_millis=60000, set in backend/app/core/telemetry.py). This interval is not currently configurable via environment variable. The Prometheus metric reader exports on-demand (every scrape, typically 15–30 seconds).

When OTEL_EXPORTER_OTLP_ENDPOINT is set, the platform also exports Python WARNING-and-above log records to the OTLP backend via the OTel Logs SDK (BatchLogRecordProcessor). This enables log-trace correlation in backends that support OTLP logs (Grafana Tempo + Loki, Honeycomb, Datadog). Log records include trace_id and span_id fields for correlation.

Log export uses the same gRPC endpoint as traces and metrics — there is no separate log endpoint.

Use an https:// prefix for TLS-terminated endpoints:

Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317

The gRPC exporter uses insecure mode when the endpoint starts with http://. Always use TLS in production.


services:
api:
image: arbitex-platform:latest
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317"
OTEL_SERVICE_NAME: "arbitex-platform"
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production"
api:
env:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
OTEL_SERVICE_NAME: "arbitex-platform"
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production,k8s.namespace=arbitex"

Set the same variables in the outpost environment. Point the OTLP endpoint at a collector co-located in your private network — do not route outpost spans through the public Arbitex cloud endpoint.

# outpost values-prod.yaml
env:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.corp.internal:4317"
OTEL_SERVICE_NAME: "arbitex-outpost"

The OTel SDK auto-instruments three libraries at startup. No code changes are required.

ServiceWhat is traced
FastAPIEvery HTTP request/response — method, path, status code, duration
httpxEvery outbound HTTP call — provider API calls, webhook deliveries, management plane syncs
SQLAlchemyEvery database query — table, operation, duration

Auto-instrumentation is silently skipped for any package that is not installed. If you see missing spans for database queries, verify that opentelemetry-instrumentation-sqlalchemy is installed.

FastAPI spans are named {method} {path_template} — for example, POST /api/chat/completions. httpx spans use the target host and path. SQLAlchemy spans include the table name and operation.


The platform emits a set of custom metrics via the OTel MeterProvider. These are in addition to the Prometheus metrics exposed at /metrics. Both sets share the same metric names when OTel is enabled — the MeterProvider exports them every 60 seconds via PeriodicExportingMetricReader.

Key platform metrics exported to OTel:

MetricTypeDescription
arbitex.requests.totalCounterTotal AI requests processed
arbitex.dlp.scan_duration_secondsHistogramDLP pipeline scan latency per tier
arbitex.provider.latency_secondsHistogramAI provider response latency by provider
arbitex.policy.rule_evaluations_totalCounterPolicy rule evaluations by action type
arbitex.override.countCounterALLOW_WITH_OVERRIDE overrides by rule
arbitex.budget.utilization_ratioGaugeBudget utilization fraction (0.0–1.0)

All metrics include service.name, deployment.environment, and any OTEL_RESOURCE_ATTRIBUTES as labels.


Jaeger 1.35+ supports native OTLP gRPC on port 4317.

  1. Start Jaeger with OTLP enabled:
Terminal window
docker run --rm -p 16686:16686 -p 4317:4317 \
jaegertracing/all-in-one:latest \
--collector.otlp.grpc.host-port=:4317
  1. Set the platform environment variable:
Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=arbitex-platform
  1. Open the Jaeger UI at http://localhost:16686. Select service arbitex-platform and search for recent traces.

Kubernetes: Deploy Jaeger via its Helm chart and point the endpoint at the collector service.


Grafana Cloud provides hosted Tempo (traces) and Prometheus (metrics) with OTLP ingestion endpoints.

  1. In Grafana Cloud, navigate to Connections > Add new connection > OpenTelemetry.
  2. Note the OTLP endpoint and generate an API token with write access to Tempo and Prometheus.
  3. Configure the platform:
Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net:443
OTEL_SERVICE_NAME=arbitex-platform
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Because Grafana Cloud requires authentication, you must configure the exporter headers. Set these as additional environment variables:

Terminal window
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64(instance_id:api_token)>"

Replace <base64(instance_id:api_token)> with your base64-encoded Grafana Cloud credentials. The exact header format is shown in the Grafana Cloud connection wizard.

  1. In Grafana Cloud, open Explore > Tempo to browse traces. Open Explore > Prometheus to query platform metrics.

The Datadog Agent 7.x+ ships an OTLP receiver. Enable it in datadog.yaml:

otlp_config:
receiver:
protocols:
grpc:
endpoint: "0.0.0.0:4317"

Or with the Helm chart:

datadog:
otlp:
receiver:
protocols:
grpc:
enabled: true
endpoint: "0.0.0.0:4317"

Configure the platform:

Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
OTEL_SERVICE_NAME=arbitex-platform
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Traces appear in Datadog APM under APM > Services > arbitex-platform. Metrics appear in the Metrics Explorer with the otel.* namespace prefix.

Note: Datadog requires the Agent to be running and reachable. If the Agent is not on the same host/network, adjust the endpoint hostname.


For any other backend, deploy an OTel Collector in front of your backend and point the platform at the collector’s OTLP gRPC receiver (default port 4317).

Example minimal collector configuration:

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
exporters:
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

Arbitex ships six pre-built Grafana dashboards. Import them from the observability/grafana/ directory in the platform repository. All dashboards use the arbitex_ Prometheus metric prefix and the arbitex-platform service name in Tempo.

1. Request Volume & Latency

Top-level overview of AI request throughput and end-to-end latency. Key panels: requests/second (counter rate), p50/p95/p99 provider latency histograms, error rate by provider, requests by model. Use this as the primary SLO dashboard.

2. DLP Pipeline

Data loss prevention pipeline performance. Panels: DLP scan duration histogram (arbitex.dlp.scan_duration_seconds) by tier (basic/standard/enterprise), entity detection hit rate by entity type, tier escalation rate, DLP-blocked request count. Useful for tuning DLP performance and identifying scan bottlenecks.

3. Policy Engine

Policy rule evaluation metrics. Panels: evaluations per second by action type (ALLOW/BLOCK/PROMPT/REDACT/ALLOW_WITH_OVERRIDE), override count by rule (arbitex.override.count), average policy evaluation duration (eval_duration_ms), top 10 rules by fire rate. Use to audit governance action frequency.

4. Provider Health

Real-time provider availability and error tracking. Panels: provider request success rate, error breakdown by HTTP status code (429 rate limit, 5xx errors), provider latency comparison, active provider kill switch status. Alert when any provider success rate drops below threshold.

5. Budget & Quota Utilization

Budget and quota consumption across the platform. Panels: budget utilization ratio (arbitex.budget.utilization_ratio) by org/tenant, quota usage by user/group, throttle event rate (80%/95%/100% threshold crossings), projected overage timeline. Links to the usage dashboard for drilldown.

6. Infrastructure

Platform infrastructure health. Panels: API process CPU and memory, database connection pool saturation, Redis hit rate, active SSE connections, background task queue depth, worker thread count. Use for capacity planning and incident investigation.

  1. Open the DLP Pipeline Latency dashboard panel.
  2. Click a spike in the arbitex.dlp.scan_duration_seconds histogram.
  3. In the drill-down view, select View traces for this time window (requires Grafana Tempo data source linked to the same trace backend).
  4. The Explore panel shows matching spans. Click a span to see the full trace: inbound HTTP → DLP scan → provider call.

When OTel is enabled, every structured log line written during a request includes a trace_id field that matches the OTel-generated trace ID. In Grafana Logs (Loki), use the Derive fields feature to create a link from trace_id log fields to the Tempo trace:

  1. In the Loki data source settings, add a derived field:
    • Name: trace_id
    • Regex: "trace_id":"([0-9a-f]{32})"
    • URL: ${__value.raw} (Tempo data source)
  2. Log lines in Explore will show a Jump to trace link next to each trace_id value.

Arbitex ships eight Prometheus alert rules in observability/prometheus/arbitex_alerts.yml. Import or include them in your Prometheus configuration alongside the scrape_configs targeting /metrics.

1. ArbitexHighErrorRate

expr: rate(arbitex_request_errors_total[5m]) / rate(arbitex_requests_total[5m]) > 0.05
for: 2m
severity: warning

Fires when the platform-wide request error rate exceeds 5% for 2 consecutive minutes. Common causes: provider outage, authentication misconfiguration, or downstream rate limiting.

2. ArbitexProviderDown

expr: arbitex_provider_success_rate < 0.5
for: 1m
severity: critical

Fires when any individual AI provider’s success rate drops below 50% for 1 minute. Suggests a provider API outage or kill switch activation.

3. ArbitexDLPScanLatencyHigh

expr: histogram_quantile(0.95, rate(arbitex_dlp_scan_duration_seconds_bucket[5m])) > 2.0
for: 5m
severity: warning

Fires when the 95th-percentile DLP scan duration exceeds 2 seconds for 5 minutes. Indicates NER/GLiNER model slowness or resource contention on the scan workers.

4. ArbitexBudgetNearLimit

expr: arbitex_budget_utilization_ratio > 0.9
for: 10m
severity: warning

Fires when any tenant/org budget utilization ratio exceeds 90% for 10 minutes. Corresponds to the platform’s 95% throttle tier. Coordinate with the tenant to review quota allocation.

5. ArbitexOverrideRateHigh

expr: rate(arbitex_override_count_total[1h]) > 10
for: 30m
severity: info

Fires when governance override events (ALLOW_WITH_OVERRIDE or PROMPT confirmations) exceed 10/hour for 30 minutes. May indicate a policy rule is too broadly configured, generating unnecessary interruptions.

6. ArbitexPolicyEvalLatencyHigh

expr: histogram_quantile(0.99, rate(arbitex_policy_eval_duration_ms_bucket[5m])) > 100
for: 5m
severity: warning

Fires when policy evaluation p99 latency exceeds 100ms. Policy evaluation runs in the hot path of every request; high latency here directly impacts end-user response time.

7. ArbitexOutpostHeartbeatMissed

expr: time() - arbitex_outpost_last_heartbeat_timestamp > 300
for: 0m
severity: critical

Fires immediately when an outpost has not sent a heartbeat in more than 5 minutes (2.5× the 120-second heartbeat interval). Indicates the outpost is unreachable, crashed, or has lost network connectivity.

8. ArbitexAuditQueueDepthHigh

expr: arbitex_audit_queue_depth > 1000
for: 5m
severity: warning

Fires when the audit log delivery queue depth exceeds 1000 events for 5 minutes. May indicate SIEM connector delivery failures or a downstream SIEM backpressure. Check the SIEM health endpoint (GET /api/admin/siem/health) and connector logs.


Symptom: Log line OTel disabled (OTEL_EXPORTER_OTLP_ENDPOINT not set) instead of the initialized message.

Fix: Verify OTEL_EXPORTER_OTLP_ENDPOINT is set in the container environment. Check for typos. The variable must be present at startup — it cannot be set dynamically after the process starts.


Symptom: Log line OTel packages not installed, skipping initialization.

Fix: The optional OTel packages are not in the Python environment. Rebuild the Docker image with OTEL_EXTRAS=1:

Terminal window
docker build --build-arg OTEL_EXTRAS=1 -t arbitex-platform:latest .

Or install the packages manually:

Terminal window
pip install \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-httpx \
opentelemetry-instrumentation-sqlalchemy

Symptom: OTel initializes but traces do not appear in Jaeger/Grafana/Datadog.

Checks:

  1. Verify the collector/backend is reachable from the platform container. Run a connectivity test:
    Terminal window
    docker exec arbitex-api curl -v telnet://otel-collector:4317
  2. Check for TLS mismatch: if the endpoint uses http:// but the collector requires TLS, switch to https://.
  3. Check for authentication issues (Grafana Cloud, Datadog): confirm the OTEL_EXPORTER_OTLP_HEADERS value is correct.
  4. The BatchSpanProcessor buffers spans and flushes them periodically. Send a few requests to generate spans. Wait up to 10 seconds for the first flush.

Symptom: Traces are visible but no metrics arrive at the backend.

Checks:

  1. Metrics are flushed every 60 seconds by PeriodicExportingMetricReader. Wait at least 90 seconds before concluding metrics are absent.
  2. Confirm the backend accepts metrics on the same OTLP gRPC endpoint as traces. Some backends use separate ports or endpoints for metrics vs. traces — if so, configure a Collector that fans out from a single receiver.

Symptom: FastAPI and httpx spans are visible but no SQLAlchemy spans.

Fix: Verify opentelemetry-instrumentation-sqlalchemy is installed in the environment. Check the startup log for a warning like SQLAlchemy instrumentation skipped: package not found.


Symptom: OTel is enabled but log lines do not include trace_id.

Cause: The _ContextVarBridgeProcessor writes the OTel trace ID into the legacy observability.py ContextVars, which are read by the structured logger. If trace_id is absent in logs, the bridge processor may not be active — verify the platform version is 0046+.