OpenTelemetry configuration guide

The Arbitex platform ships an opt-in OpenTelemetry (OTel) SDK integration (platform-0046). When enabled, the platform exports distributed traces and metrics to any OTLP-compatible backend. This guide walks through enabling OTel, connecting to common backends, using the Grafana dashboards, and diagnosing configuration problems.

For the internal SDK architecture reference (TracerProvider internals, ContextVar bridge, Python package requirements), see OpenTelemetry SDK integration.

Prerequisites

Arbitex platform 0046+ (platform-0046 ships the OTel SDK).
An OTLP-compatible backend (Jaeger, Grafana Tempo, Grafana Cloud, Datadog Agent, Honeycomb, or any OTel Collector).
Docker image built with OTEL_EXTRAS=1 or optional OTel packages installed in your environment (see Required packages).

Quick start

OTel is disabled by default. To enable it, set one environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

That is all that is required for a working integration. When set, the platform initializes the TracerProvider and MeterProvider at startup and begins exporting traces and metrics to the endpoint.

Verify by checking the startup log:

OTel initialized: service=arbitex-platform, endpoint=http://otel-collector:4317

Environment variables

Variable	Required	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	Yes	—	gRPC OTLP exporter endpoint, e.g. `http://otel-collector:4317`. Setting this variable activates OTel OTLP export. Prometheus `/metrics` scraping is always available regardless of this setting.
`OTEL_SERVICE_NAME`	No	`arbitex-platform`	Service name on all spans and metrics (`service.name` resource attribute).
`OTEL_RESOURCE_ATTRIBUTES`	No	—	Additional resource attributes in `key=value,key=value` format appended to all telemetry.
`OTEL_TRACES_SAMPLER`	No	`parentbased_always_on`	Trace sampling strategy. Set to `traceidratio` with `OTEL_TRACES_SAMPLER_ARG` to reduce span volume.
`OTEL_TRACES_SAMPLER_ARG`	No	`1.0`	Sampling ratio when using `traceidratio` sampler. Example: `0.1` = 10% of traces sampled.

Metric export interval

The OTLP metric exporter uses a PeriodicExportingMetricReader with a fixed 60-second flush interval (export_interval_millis=60000, set in backend/app/core/telemetry.py). This interval is not currently configurable via environment variable. The Prometheus metric reader exports on-demand (every scrape, typically 15–30 seconds).

Log export

When OTEL_EXPORTER_OTLP_ENDPOINT is set, the platform also exports Python WARNING-and-above log records to the OTLP backend via the OTel Logs SDK (BatchLogRecordProcessor). This enables log-trace correlation in backends that support OTLP logs (Grafana Tempo + Loki, Honeycomb, Datadog). Log records include trace_id and span_id fields for correlation.

Log export uses the same gRPC endpoint as traces and metrics — there is no separate log endpoint.

TLS

Use an https:// prefix for TLS-terminated endpoints:

OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317

The gRPC exporter uses insecure mode when the endpoint starts with http://. Always use TLS in production.

Deployment configuration

Docker Compose

services:
  api:
    image: arbitex-platform:latest
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317"
      OTEL_SERVICE_NAME: "arbitex-platform"
      OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production"

Kubernetes (Helm)

api:
  env:
    OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
    OTEL_SERVICE_NAME: "arbitex-platform"
    OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production,k8s.namespace=arbitex"

Outpost deployments

Set the same variables in the outpost environment. Point the OTLP endpoint at a collector co-located in your private network — do not route outpost spans through the public Arbitex cloud endpoint.

# outpost values-prod.yaml
env:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.corp.internal:4317"
  OTEL_SERVICE_NAME: "arbitex-outpost"

Auto-instrumented services

The OTel SDK auto-instruments three libraries at startup. No code changes are required.

Service	What is traced
FastAPI	Every HTTP request/response — method, path, status code, duration
httpx	Every outbound HTTP call — provider API calls, webhook deliveries, management plane syncs
SQLAlchemy	Every database query — table, operation, duration

Auto-instrumentation is silently skipped for any package that is not installed. If you see missing spans for database queries, verify that opentelemetry-instrumentation-sqlalchemy is installed.

Span naming

FastAPI spans are named {method} {path_template} — for example, POST /api/chat/completions. httpx spans use the target host and path. SQLAlchemy spans include the table name and operation.

Custom metrics

The platform emits a set of custom metrics via the OTel MeterProvider. These are in addition to the Prometheus metrics exposed at /metrics. Both sets share the same metric names when OTel is enabled — the MeterProvider exports them every 60 seconds via PeriodicExportingMetricReader.

Key platform metrics exported to OTel:

Metric	Type	Description
`arbitex.requests.total`	Counter	Total AI requests processed
`arbitex.dlp.scan_duration_seconds`	Histogram	DLP pipeline scan latency per tier
`arbitex.provider.latency_seconds`	Histogram	AI provider response latency by provider
`arbitex.policy.rule_evaluations_total`	Counter	Policy rule evaluations by action type
`arbitex.override.count`	Counter	ALLOW_WITH_OVERRIDE overrides by rule
`arbitex.budget.utilization_ratio`	Gauge	Budget utilization fraction (0.0–1.0)

All metrics include service.name, deployment.environment, and any OTEL_RESOURCE_ATTRIBUTES as labels.

Backend configuration

Jaeger

Jaeger 1.35+ supports native OTLP gRPC on port 4317.

Start Jaeger with OTLP enabled:

docker run --rm -p 16686:16686 -p 4317:4317 \
  jaegertracing/all-in-one:latest \
  --collector.otlp.grpc.host-port=:4317

Set the platform environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=arbitex-platform

Open the Jaeger UI at http://localhost:16686. Select service arbitex-platform and search for recent traces.

Kubernetes: Deploy Jaeger via its Helm chart and point the endpoint at the collector service.

Grafana Cloud (Tempo + Prometheus)

Grafana Cloud provides hosted Tempo (traces) and Prometheus (metrics) with OTLP ingestion endpoints.

In Grafana Cloud, navigate to Connections > Add new connection > OpenTelemetry.
Note the OTLP endpoint and generate an API token with write access to Tempo and Prometheus.
Configure the platform:

OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net:443
OTEL_SERVICE_NAME=arbitex-platform
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Because Grafana Cloud requires authentication, you must configure the exporter headers. Set these as additional environment variables:

OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64(instance_id:api_token)>"

Replace <base64(instance_id:api_token)> with your base64-encoded Grafana Cloud credentials. The exact header format is shown in the Grafana Cloud connection wizard.

In Grafana Cloud, open Explore > Tempo to browse traces. Open Explore > Prometheus to query platform metrics.

Datadog

The Datadog Agent 7.x+ ships an OTLP receiver. Enable it in datadog.yaml:

otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"

Or with the Helm chart:

datadog:
  otlp:
    receiver:
      protocols:
        grpc:
          enabled: true
          endpoint: "0.0.0.0:4317"

Configure the platform:

OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
OTEL_SERVICE_NAME=arbitex-platform
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Traces appear in Datadog APM under APM > Services > arbitex-platform. Metrics appear in the Metrics Explorer with the otel.* namespace prefix.

Note: Datadog requires the Agent to be running and reachable. If the Agent is not on the same host/network, adjust the endpoint hostname.

Generic OTel Collector

For any other backend, deploy an OTel Collector in front of your backend and point the platform at the collector’s OTLP gRPC receiver (default port 4317).

Example minimal collector configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Grafana dashboards

Arbitex ships six pre-built Grafana dashboards. Import them from the observability/grafana/ directory in the platform repository. All dashboards use the arbitex_ Prometheus metric prefix and the arbitex-platform service name in Tempo.

Dashboard catalog

1. Request Volume & Latency

Top-level overview of AI request throughput and end-to-end latency. Key panels: requests/second (counter rate), p50/p95/p99 provider latency histograms, error rate by provider, requests by model. Use this as the primary SLO dashboard.

2. DLP Pipeline

Data loss prevention pipeline performance. Panels: DLP scan duration histogram (arbitex.dlp.scan_duration_seconds) by tier (basic/standard/enterprise), entity detection hit rate by entity type, tier escalation rate, DLP-blocked request count. Useful for tuning DLP performance and identifying scan bottlenecks.

3. Policy Engine

Policy rule evaluation metrics. Panels: evaluations per second by action type (ALLOW/BLOCK/PROMPT/REDACT/ALLOW_WITH_OVERRIDE), override count by rule (arbitex.override.count), average policy evaluation duration (eval_duration_ms), top 10 rules by fire rate. Use to audit governance action frequency.

4. Provider Health

Real-time provider availability and error tracking. Panels: provider request success rate, error breakdown by HTTP status code (429 rate limit, 5xx errors), provider latency comparison, active provider kill switch status. Alert when any provider success rate drops below threshold.

5. Budget & Quota Utilization

Budget and quota consumption across the platform. Panels: budget utilization ratio (arbitex.budget.utilization_ratio) by org/tenant, quota usage by user/group, throttle event rate (80%/95%/100% threshold crossings), projected overage timeline. Links to the usage dashboard for drilldown.

6. Infrastructure

Platform infrastructure health. Panels: API process CPU and memory, database connection pool saturation, Redis hit rate, active SSE connections, background task queue depth, worker thread count. Use for capacity planning and incident investigation.

Navigating traces from dashboard panels

Open the DLP Pipeline Latency dashboard panel.
Click a spike in the arbitex.dlp.scan_duration_seconds histogram.
In the drill-down view, select View traces for this time window (requires Grafana Tempo data source linked to the same trace backend).
The Explore panel shows matching spans. Click a span to see the full trace: inbound HTTP → DLP scan → provider call.

Log-to-trace correlation

When OTel is enabled, every structured log line written during a request includes a trace_id field that matches the OTel-generated trace ID. In Grafana Logs (Loki), use the Derive fields feature to create a link from trace_id log fields to the Tempo trace:

In the Loki data source settings, add a derived field:
- Name: trace_id
- Regex: "trace_id":"([0-9a-f]{32})"
- URL: ${__value.raw} (Tempo data source)
Log lines in Explore will show a Jump to trace link next to each trace_id value.

Prometheus alert rules

Arbitex ships eight Prometheus alert rules in observability/prometheus/arbitex_alerts.yml. Import or include them in your Prometheus configuration alongside the scrape_configs targeting /metrics.

Alert catalog

1. ArbitexHighErrorRate

expr: rate(arbitex_request_errors_total[5m]) / rate(arbitex_requests_total[5m]) > 0.05
for: 2m
severity: warning

Fires when the platform-wide request error rate exceeds 5% for 2 consecutive minutes. Common causes: provider outage, authentication misconfiguration, or downstream rate limiting.

2. ArbitexProviderDown

expr: arbitex_provider_success_rate < 0.5
for: 1m
severity: critical

Fires when any individual AI provider’s success rate drops below 50% for 1 minute. Suggests a provider API outage or kill switch activation.

3. ArbitexDLPScanLatencyHigh

expr: histogram_quantile(0.95, rate(arbitex_dlp_scan_duration_seconds_bucket[5m])) > 2.0
for: 5m
severity: warning

Fires when the 95th-percentile DLP scan duration exceeds 2 seconds for 5 minutes. Indicates NER/GLiNER model slowness or resource contention on the scan workers.

4. ArbitexBudgetNearLimit

expr: arbitex_budget_utilization_ratio > 0.9
for: 10m
severity: warning

Fires when any tenant/org budget utilization ratio exceeds 90% for 10 minutes. Corresponds to the platform’s 95% throttle tier. Coordinate with the tenant to review quota allocation.

5. ArbitexOverrideRateHigh

expr: rate(arbitex_override_count_total[1h]) > 10
for: 30m
severity: info

Fires when governance override events (ALLOW_WITH_OVERRIDE or PROMPT confirmations) exceed 10/hour for 30 minutes. May indicate a policy rule is too broadly configured, generating unnecessary interruptions.

6. ArbitexPolicyEvalLatencyHigh

expr: histogram_quantile(0.99, rate(arbitex_policy_eval_duration_ms_bucket[5m])) > 100
for: 5m
severity: warning

Fires when policy evaluation p99 latency exceeds 100ms. Policy evaluation runs in the hot path of every request; high latency here directly impacts end-user response time.

7. ArbitexOutpostHeartbeatMissed

expr: time() - arbitex_outpost_last_heartbeat_timestamp > 300
for: 0m
severity: critical

Fires immediately when an outpost has not sent a heartbeat in more than 5 minutes (2.5× the 120-second heartbeat interval). Indicates the outpost is unreachable, crashed, or has lost network connectivity.

8. ArbitexAuditQueueDepthHigh

expr: arbitex_audit_queue_depth > 1000
for: 5m
severity: warning

Fires when the audit log delivery queue depth exceeds 1000 events for 5 minutes. May indicate SIEM connector delivery failures or a downstream SIEM backpressure. Check the SIEM health endpoint (GET /api/admin/siem/health) and connector logs.

Troubleshooting

OTel does not initialize

Symptom: Log line OTel disabled (OTEL_EXPORTER_OTLP_ENDPOINT not set) instead of the initialized message.

Fix: Verify OTEL_EXPORTER_OTLP_ENDPOINT is set in the container environment. Check for typos. The variable must be present at startup — it cannot be set dynamically after the process starts.

OTel packages not installed

Symptom: Log line OTel packages not installed, skipping initialization.

Fix: The optional OTel packages are not in the Python environment. Rebuild the Docker image with OTEL_EXTRAS=1:

docker build --build-arg OTEL_EXTRAS=1 -t arbitex-platform:latest .

Or install the packages manually:

pip install \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-httpx \
  opentelemetry-instrumentation-sqlalchemy

No spans appearing in backend

Symptom: OTel initializes but traces do not appear in Jaeger/Grafana/Datadog.

Checks:

Verify the collector/backend is reachable from the platform container. Run a connectivity test:
Terminal window
```
docker exec arbitex-api curl -v telnet://otel-collector:4317
```
Check for TLS mismatch: if the endpoint uses http:// but the collector requires TLS, switch to https://.
Check for authentication issues (Grafana Cloud, Datadog): confirm the OTEL_EXPORTER_OTLP_HEADERS value is correct.
The BatchSpanProcessor buffers spans and flushes them periodically. Send a few requests to generate spans. Wait up to 10 seconds for the first flush.

Traces appear but metrics do not

Symptom: Traces are visible but no metrics arrive at the backend.

Checks:

Metrics are flushed every 60 seconds by PeriodicExportingMetricReader. Wait at least 90 seconds before concluding metrics are absent.
Confirm the backend accepts metrics on the same OTLP gRPC endpoint as traces. Some backends use separate ports or endpoints for metrics vs. traces — if so, configure a Collector that fans out from a single receiver.

Missing database query spans

Symptom: FastAPI and httpx spans are visible but no SQLAlchemy spans.

Fix: Verify opentelemetry-instrumentation-sqlalchemy is installed in the environment. Check the startup log for a warning like SQLAlchemy instrumentation skipped: package not found.

trace_id not appearing in log lines

Symptom: OTel is enabled but log lines do not include trace_id.

Cause: The _ContextVarBridgeProcessor writes the OTel trace ID into the legacy observability.py ContextVars, which are read by the structured logger. If trace_id is absent in logs, the bridge processor may not be active — verify the platform version is 0046+.

OpenTelemetry configuration guide

Prerequisites

Quick start

Environment variables

Metric export interval

Log export

TLS

Deployment configuration

Docker Compose

Kubernetes (Helm)

Outpost deployments

Auto-instrumented services

Span naming

Custom metrics

Backend configuration

Jaeger

Grafana Cloud (Tempo + Prometheus)

Datadog

Generic OTel Collector

Grafana dashboards

Dashboard catalog

Navigating traces from dashboard panels

Log-to-trace correlation

Prometheus alert rules

Alert catalog

Troubleshooting

OTel does not initialize

OTel packages not installed

No spans appearing in backend

Traces appear but metrics do not

Missing database query spans

trace_id not appearing in log lines

See also