Credential Intelligence — Outpost

Credential Intelligence (CredInt) detects known-compromised credentials in AI traffic — passwords, API keys, tokens, and connection strings that have appeared in breach data. On the Arbitex SaaS platform, CredInt is provided centrally. On the Outpost, CredInt runs entirely within the customer VPC using a bundled bloom filter, with no credentials leaving the customer environment for lookup.

This document covers the Outpost-specific CredInt architecture. For the platform-level CredInt overview (what it detects, audit log fields, policy actions), see Credential Intelligence.

What CredInt detects

CredInt adds a fourth tier to the Outpost DLP pipeline. It answers a specific question the earlier tiers cannot: “Has this credential appeared in known breach data?”

The earlier tiers detect credentials by shape and pattern:

Tier 1 (Regex): Structured patterns — SSN formats, credit card numbers, API key formats, connection strings
Tier 2 (NER/spaCy): Named entity recognition — email addresses, usernames, tokens adjacent to credential-shaped context
Tier 3 (DeBERTa): Contextual classification — credential-sharing intent even when patterns don’t match

CredInt (Tier 4) checks the specific values detected by the earlier tiers against the Arbitex breach corpus. A Tier 1 regex match on a password field tells you a password is present. A CredInt hit on that same value tells you the password has already been compromised and is likely in active use by attackers.

DLP cascade with CredInt

Tier 1  Regex          → structured pattern matching (< 1 ms)
Tier 2  NER/spaCy      → named entity recognition (5–50 ms)
Tier 3  DeBERTa        → contextual classification (50–500 ms, optional)
Tier 4  CredInt bloom  → known-compromised credential matching (< 1 ms)

CredInt is placed last in the cascade because:

Tiers 1–3 block or redact many credentials before CredInt is reached. CredInt adds signal for credentials that escape pattern matching — dictionary passwords, novel formats, or values that DeBERTa classifies as non-credentials.
CredInt provides corpus-membership signal that the earlier tiers cannot. “This credential is in the breach corpus” is distinct from “this text contains a credential-shaped string.”
CredInt does not short-circuit the pipeline. It contributes its entity detections and lets the policy resolver determine the final action — consistent with how the other tiers operate.

What triggers a CredInt check

Before querying the bloom filter, a lightweight token extractor identifies credential-shaped tokens in the prompt:

Email-password pairs (user@example.com:password123)
Colon-separated username:password pairs
High-entropy strings adjacent to @ or :// patterns
API key formats not already handled by Tier 1 regex

This pre-filter limits bloom filter lookups per request to a manageable count (typically 0–20 per prompt). Non-credential text never reaches the bloom filter.

Architecture: hybrid bloom filter

The Outpost uses a hybrid bundled + optional CDN refresh architecture (ADR-0001). This design satisfies two competing requirements: air-gap viability (all lookup data must be available offline) and freshness (newly-breached credentials should be detectable as soon as possible).

Bundled filter (baseline)

At Docker image build time, a compressed bloom filter binary is embedded in the image layer. This filter is always present, regardless of network connectivity.

Corpus size: 861M+ compromised credentials
FPR target: 10% (see Bloom filter trade-offs below)
Compressed size: ~440–470 MB
Memory footprint at runtime: ~470 MB RSS
Lookup latency: < 1 μs per entry (O(k) hash operations, k ≈ 5–7)

The bundled filter is fully self-contained. No network calls are made at startup or during request processing.

CDN refresh (internet-connected mode)

When CREDINT_DOWNLOAD_URL is set and the container has internet access, the Outpost attempts to download a fresher filter from Arbitex CDN at startup. If the download succeeds within the configured timeout (default: 45 seconds), the downloaded filter replaces the bundled one in memory. If the download fails or times out, startup continues using the bundled filter — no error, no startup failure.

The downloaded filter carries a metadata header with a corpus snapshot date. If the downloaded filter’s snapshot date is older than the bundled filter, the bundled filter is used (this guards against CDN misconfigurations serving a downgrade).

Air-gap mode

In air-gap deployments, CREDINT_DOWNLOAD_URL is left unset (the default). The bundled filter is loaded at startup. No network calls are made at any point. Full CredInt capability is available from first boot.

Air-gap mode:                    Internet-connected mode:
┌──────────────┐                 ┌──────────────┐
│ Bundled .bf  │                 │ Bundled .bf  │ ← fallback
│ (in image)   │                 │ (in image)   │
└──────┬───────┘                 └──────┬───────┘
       │ loaded at startup              │ if download fails/times out
       ▼                                ▼
  CredInt active                   CDN refresh attempt (45s timeout)
                                        │ if download succeeds
                                        ▼
                                   Downloaded .bf
                                   (newer snapshot)

Configuration

All CredInt configuration is via environment variables in .env.

Variable	Default	Description
`CREDINT_ENABLED`	`true`	Master switch. Set to `false` to disable CredInt entirely.
`CREDINT_BLOOM_PATH`	`/app/credint.bf`	Path to the bundled filter inside the container. Set at image build time — do not change unless supplying a custom filter.
`CREDINT_DOWNLOAD_URL`	`""` (empty)	CDN URL for startup filter refresh. Empty = air-gap mode.
`CREDINT_DOWNLOAD_TIMEOUT_SECONDS`	`45`	Maximum wait for CDN download at startup.
`CREDINT_FPR_THRESHOLD`	`0.10`	Downloaded filters with FPR higher than this value are rejected.

For air-gap deployments, only CREDINT_ENABLED needs attention. The defaults are correct for air-gap.

Helm chart values (Kubernetes deployments)

For Kubernetes deployments using the Outpost Helm chart, the corresponding values in values.yaml:

credint:
  enabled: true
  downloadUrl: ""          # empty = air-gap mode
  downloadTimeoutSeconds: 45

resources:
  limits:
    memory: 2Gi            # Required — bloom filter adds ~470 MB RSS

The memory limit must account for the bloom filter’s runtime footprint. The minimum recommended container memory limit with CredInt enabled is 2 Gi.

Bloom filter trade-offs

The bundled filter is built at 10% false positive rate (FPR). The implications:

FPR vs size

For 861M corpus entries:

FPR	Compressed size	RAM footprint
1%	~1.0–1.1 GB	~1.1 GB
5%	~580–640 MB	~640 MB
10%	~440–470 MB	~470 MB

The 10% FPR filter keeps the image layer within the 5 Gi container image target. Higher FPR choices are available via the CDN refresh path for internet-connected customers who need lower false positive rates.

What 10% FPR means in practice

A 10% FPR means 1-in-10 credential-shaped tokens that are not in the breach corpus will trigger a false CredInt hit. However:

The regex pre-filter (Tier 1) and NER (Tier 2) eliminate most non-credential text before it reaches CredInt. Only tokens that already passed credential-shape heuristics are checked.
Credential-shaped tokens in normal business prompts are rare. The user-visible false positive rate (false alarms per thousand requests) is very low in practice.
The policy action for a CredInt hit is configurable — REDACT (strip the credential from the prompt) or BLOCK (reject the request). REDACT is more tolerant of false positives.

Customers requiring lower FPR in an internet-connected deployment can set CREDINT_DOWNLOAD_URL to pull a 1% FPR filter at startup. This uses more startup time (download + load) but does not affect the image size.

Staleness in air-gap deployments

The bundled filter is static between image releases. The breach-to-detection window equals the image release cadence:

At a weekly release cadence: credentials breached in the past 7 days are not in the filter.
At a monthly release cadence: up to 30 days of new breach data is missing.

This is the same trade-off air-gap customers accept for GeoIP MMDB data (also bundled at image build time). For regulated customers operating in air-gap mode, this staleness window is an accepted security cost — the alternative (any network call for lookup) is disqualifying.

Internet-connected customers using the CDN refresh path receive the latest filter snapshot available at startup (typically updated weekly or on significant breach events), reducing the staleness window significantly without requiring an image rebuild.

Deployment model comparison

Aspect	Air-gap mode	Internet-connected mode
`CREDINT_DOWNLOAD_URL`	Not set	Set to Arbitex CDN URL
Filter source	Bundled in image	CDN download at startup (bundled as fallback)
Filter freshness	Image release cadence	CDN snapshot cadence (typically weekly)
Startup time impact	None	Up to 45s for download + load
Network dependency	None	CDN reachable at startup (optional — failure is non-fatal)
Credential data egress	None	None — only filter binary is downloaded, not credentials
FPR	10% (bundled)	Lower FPR available via CDN

Audit log fields

When CredInt detects a match, the audit entry includes:

Field	Value
`dlp_entities[].type`	`COMPROMISED_CREDENTIAL`
`dlp_entities[].source`	`credint`
`dlp_entities[].tier`	`4`
`dlp_action`	Determined by policy (`REDACT` or `BLOCK`)

For full audit log field reference, see Credential Intelligence.