Skip to content

Credential Intelligence — Outpost

Credential Intelligence (CredInt) detects known-compromised credentials in AI traffic — passwords, API keys, tokens, and connection strings that have appeared in breach data. On the Arbitex SaaS platform, CredInt is provided centrally. On the Outpost, CredInt runs entirely within the customer VPC using a bundled bloom filter, with no credentials leaving the customer environment for lookup.

This document covers the Outpost-specific CredInt architecture. For the platform-level CredInt overview (what it detects, audit log fields, policy actions), see Credential Intelligence.


CredInt adds a fourth tier to the Outpost DLP pipeline. It answers a specific question the earlier tiers cannot: “Has this credential appeared in known breach data?”

The earlier tiers detect credentials by shape and pattern:

  • Tier 1 (Regex): Structured patterns — SSN formats, credit card numbers, API key formats, connection strings
  • Tier 2 (NER/spaCy): Named entity recognition — email addresses, usernames, tokens adjacent to credential-shaped context
  • Tier 3 (DeBERTa): Contextual classification — credential-sharing intent even when patterns don’t match

CredInt (Tier 4) checks the specific values detected by the earlier tiers against the Arbitex breach corpus. A Tier 1 regex match on a password field tells you a password is present. A CredInt hit on that same value tells you the password has already been compromised and is likely in active use by attackers.

Tier 1 Regex → structured pattern matching (< 1 ms)
Tier 2 NER/spaCy → named entity recognition (5–50 ms)
Tier 3 DeBERTa → contextual classification (50–500 ms, optional)
Tier 4 CredInt bloom → known-compromised credential matching (< 1 ms)

CredInt is placed last in the cascade because:

  1. Tiers 1–3 block or redact many credentials before CredInt is reached. CredInt adds signal for credentials that escape pattern matching — dictionary passwords, novel formats, or values that DeBERTa classifies as non-credentials.
  2. CredInt provides corpus-membership signal that the earlier tiers cannot. “This credential is in the breach corpus” is distinct from “this text contains a credential-shaped string.”
  3. CredInt does not short-circuit the pipeline. It contributes its entity detections and lets the policy resolver determine the final action — consistent with how the other tiers operate.

Before querying the bloom filter, a lightweight token extractor identifies credential-shaped tokens in the prompt:

  • Email-password pairs (user@example.com:password123)
  • Colon-separated username:password pairs
  • High-entropy strings adjacent to @ or :// patterns
  • API key formats not already handled by Tier 1 regex

This pre-filter limits bloom filter lookups per request to a manageable count (typically 0–20 per prompt). Non-credential text never reaches the bloom filter.


The Outpost uses a hybrid bundled + optional CDN refresh architecture (ADR-0001). This design satisfies two competing requirements: air-gap viability (all lookup data must be available offline) and freshness (newly-breached credentials should be detectable as soon as possible).

At Docker image build time, a compressed bloom filter binary is embedded in the image layer. This filter is always present, regardless of network connectivity.

  • Corpus size: 861M+ compromised credentials
  • FPR target: 10% (see Bloom filter trade-offs below)
  • Compressed size: ~440–470 MB
  • Memory footprint at runtime: ~470 MB RSS
  • Lookup latency: < 1 μs per entry (O(k) hash operations, k ≈ 5–7)

The bundled filter is fully self-contained. No network calls are made at startup or during request processing.

When CREDINT_DOWNLOAD_URL is set and the container has internet access, the Outpost attempts to download a fresher filter from Arbitex CDN at startup. If the download succeeds within the configured timeout (default: 45 seconds), the downloaded filter replaces the bundled one in memory. If the download fails or times out, startup continues using the bundled filter — no error, no startup failure.

The downloaded filter carries a metadata header with a corpus snapshot date. If the downloaded filter’s snapshot date is older than the bundled filter, the bundled filter is used (this guards against CDN misconfigurations serving a downgrade).

In air-gap deployments, CREDINT_DOWNLOAD_URL is left unset (the default). The bundled filter is loaded at startup. No network calls are made at any point. Full CredInt capability is available from first boot.

Air-gap mode: Internet-connected mode:
┌──────────────┐ ┌──────────────┐
│ Bundled .bf │ │ Bundled .bf │ ← fallback
│ (in image) │ │ (in image) │
└──────┬───────┘ └──────┬───────┘
│ loaded at startup │ if download fails/times out
▼ ▼
CredInt active CDN refresh attempt (45s timeout)
│ if download succeeds
Downloaded .bf
(newer snapshot)

All CredInt configuration is via environment variables in .env.

VariableDefaultDescription
CREDINT_ENABLEDtrueMaster switch. Set to false to disable CredInt entirely.
CREDINT_BLOOM_PATH/app/credint.bfPath to the bundled filter inside the container. Set at image build time — do not change unless supplying a custom filter.
CREDINT_DOWNLOAD_URL"" (empty)CDN URL for startup filter refresh. Empty = air-gap mode.
CREDINT_DOWNLOAD_TIMEOUT_SECONDS45Maximum wait for CDN download at startup.
CREDINT_FPR_THRESHOLD0.10Downloaded filters with FPR higher than this value are rejected.

For air-gap deployments, only CREDINT_ENABLED needs attention. The defaults are correct for air-gap.

Helm chart values (Kubernetes deployments)

Section titled “Helm chart values (Kubernetes deployments)”

For Kubernetes deployments using the Outpost Helm chart, the corresponding values in values.yaml:

credint:
enabled: true
downloadUrl: "" # empty = air-gap mode
downloadTimeoutSeconds: 45
resources:
limits:
memory: 2Gi # Required — bloom filter adds ~470 MB RSS

The memory limit must account for the bloom filter’s runtime footprint. The minimum recommended container memory limit with CredInt enabled is 2 Gi.


The bundled filter is built at 10% false positive rate (FPR). The implications:

For 861M corpus entries:

FPRCompressed sizeRAM footprint
1%~1.0–1.1 GB~1.1 GB
5%~580–640 MB~640 MB
10%~440–470 MB~470 MB

The 10% FPR filter keeps the image layer within the 5 Gi container image target. Higher FPR choices are available via the CDN refresh path for internet-connected customers who need lower false positive rates.

A 10% FPR means 1-in-10 credential-shaped tokens that are not in the breach corpus will trigger a false CredInt hit. However:

  • The regex pre-filter (Tier 1) and NER (Tier 2) eliminate most non-credential text before it reaches CredInt. Only tokens that already passed credential-shape heuristics are checked.
  • Credential-shaped tokens in normal business prompts are rare. The user-visible false positive rate (false alarms per thousand requests) is very low in practice.
  • The policy action for a CredInt hit is configurable — REDACT (strip the credential from the prompt) or BLOCK (reject the request). REDACT is more tolerant of false positives.

Customers requiring lower FPR in an internet-connected deployment can set CREDINT_DOWNLOAD_URL to pull a 1% FPR filter at startup. This uses more startup time (download + load) but does not affect the image size.

The bundled filter is static between image releases. The breach-to-detection window equals the image release cadence:

  • At a weekly release cadence: credentials breached in the past 7 days are not in the filter.
  • At a monthly release cadence: up to 30 days of new breach data is missing.

This is the same trade-off air-gap customers accept for GeoIP MMDB data (also bundled at image build time). For regulated customers operating in air-gap mode, this staleness window is an accepted security cost — the alternative (any network call for lookup) is disqualifying.

Internet-connected customers using the CDN refresh path receive the latest filter snapshot available at startup (typically updated weekly or on significant breach events), reducing the staleness window significantly without requiring an image rebuild.


AspectAir-gap modeInternet-connected mode
CREDINT_DOWNLOAD_URLNot setSet to Arbitex CDN URL
Filter sourceBundled in imageCDN download at startup (bundled as fallback)
Filter freshnessImage release cadenceCDN snapshot cadence (typically weekly)
Startup time impactNoneUp to 45s for download + load
Network dependencyNoneCDN reachable at startup (optional — failure is non-fatal)
Credential data egressNoneNone — only filter binary is downloaded, not credentials
FPR10% (bundled)Lower FPR available via CDN

When CredInt detects a match, the audit entry includes:

FieldValue
dlp_entities[].typeCOMPROMISED_CREDENTIAL
dlp_entities[].sourcecredint
dlp_entities[].tier4
dlp_actionDetermined by policy (REDACT or BLOCK)

For full audit log field reference, see Credential Intelligence.