DLP Overview

The Arbitex Data Loss Prevention (DLP) pipeline inspects every request and response passing through the gateway. It identifies sensitive data — credit card numbers, personally identifiable information, protected health information, credentials, and custom patterns you define — and produces a list of findings: entity type, confidence score, and position in the text.

The DLP pipeline handles detection. The Policy Engine handles enforcement — it receives the DLP findings and decides what action to take (block, redact, allow, or route). The two systems are complementary: the DLP pipeline finds sensitive data, and the Policy Engine acts on what was found.

flowchart TD
    Input["Text Input"]
    Input --> T1["Tier 1: Pattern Matching\n65+ regex patterns + checksum validation\n<1ms latency"]
    T1 --> T2["Tier 2: Entity Recognition (NER)\nGPU-accelerated microservice\n60-160ms latency"]
    T2 --> T3["Tier 3: Contextual Validation\nDeBERTa NLI model\n17-90ms latency"]
    T3 --> Combine["Combine all findings"]
    Combine --> PE["Policy Engine\nevaluates findings\nagainst policy chain"]

    style T1 fill:#e8f5e9
    style T2 fill:#fff3e0
    style T3 fill:#fce4ec

How the Pipeline Works

The DLP pipeline runs three inspection tiers in sequence. Each tier adds a different detection capability, and findings from all tiers are combined before an enforcement action is taken.

Tier 1: Pattern Matching

The first tier applies regular expression patterns against the full text of every request and response. These patterns detect structured sensitive data with well-defined formats:

Payment card numbers (Visa, Mastercard, Amex, Discover, JCB, UnionPay, Diners Club) with Luhn checksum validation
Government identifiers (Social Security numbers, passport formats, tax IDs)
Financial identifiers (IBAN, SWIFT/BIC, routing numbers, ABA numbers)
Healthcare identifiers (DEA numbers, NPI, medical record number formats)
Network and infrastructure data (IPv4/IPv6 addresses, private keys, API tokens, connection strings)

Pattern matching runs in-process with sub-millisecond latency. Checksum validators (Luhn, IBAN, DEA, NPI, and others) reduce false positives by verifying that detected strings are structurally valid, not just format matches.

Tier 2: Entity Recognition

The second tier uses named entity recognition (NER) to detect unstructured sensitive data that regular expressions cannot reliably match:

Person names in free text
Organization names and corporate identifiers
Addresses and location references
Medical terms and clinical language
Dates of birth and age references in context

Entity recognition runs as a GPU-accelerated microservice. It processes text in parallel with Tier 1 — the pipeline does not wait for pattern matching to complete before starting entity recognition.

Tier 3: Contextual Validation

The third tier applies a contextual natural language inference (NLI) model to validate findings from Tier 1 and Tier 2. This tier answers the question: “Is this finding actually sensitive data, or is it a coincidental match?”

For example, the string “Jordan Smith” might appear in a conversation about a basketball player (not sensitive) or in a medical record (sensitive). The contextual validator examines the surrounding text to determine whether the finding should be escalated or dismissed.

The contextual validator significantly reduces false positives without reducing detection coverage. It runs as a separate GPU-accelerated microservice with automatic health monitoring — if the service is unavailable, the pipeline continues with Tier 1 and Tier 2 findings only (fail-open).

How DLP Findings Drive Enforcement

The DLP pipeline produces findings. The Policy Engine consumes those findings and determines the enforcement action. This separation means you configure what to detect in the DLP pipeline and what to do about it in your policy rules.

A DLP finding includes:

Field	Description
`entity_type`	The type of sensitive data detected (e.g., `credit_card`, `ssn`, `person_name`)
`tier`	Which DLP tier produced the detection (1, 2, or 3)
`confidence`	Detection confidence score (0.0–1.0)
`offset`	Character position in the text where the finding starts
`length`	Character length of the detected span

Policy rules reference DLP findings through the entity_types and entity_confidence_min conditions. For example, a rule with entity_types: ["credit_card"] and entity_confidence_min: 0.85 fires only when the DLP pipeline detects a credit card number with at least 85% confidence. The rule’s action (BLOCK, REDACT, ALLOW, ROUTE_TO) determines what happens next.

See Policy Rule Reference for the full list of conditions that reference DLP findings, and Policy Engine overview for how enforcement decisions are made.

Detection Rule Sources

The DLP pipeline runs detection patterns from three sources:

Compliance Bundle patterns — detection rules from activated compliance frameworks (PCI-DSS, HIPAA, etc.). These patterns are read-only and cannot be suppressed. They run alongside all other detection rules.
Organization custom patterns — regular expressions you define for your specific environment (e.g., internal project names, proprietary identifiers).
Platform default patterns — the built-in pattern library. You can suppress individual platform defaults at the organization level if they produce false positives in your environment.

All detection sources contribute findings to the same output list. The Policy Engine evaluates those findings against your policy chain to determine the enforcement action. Enforcement precedence is controlled by the policy chain sequence order, not the detection source. See Policy Engine overview — Conflict resolution by sequencing.

Configuring Custom Rules

Organization admins can define custom DLP rules that run alongside the platform defaults. Each custom rule specifies:

Name: A human-readable label for the rule
Pattern: A regular expression that defines what to detect
Action: Block, redact, or log
Severity: Critical, high, medium, or low (affects audit log priority and alert routing)

Custom rules are cached per organization with a 60-second TTL. Changes to rules take effect within one minute across all gateway instances.

Suppressing Platform Defaults

If a platform default pattern produces false positives in your environment, you can suppress it at the organization level. Suppressed patterns stop producing findings for your organization but remain active for all other organizations on the platform.

Important: Compliance Bundle detection patterns cannot be suppressed. If your organization has activated HIPAA compliance, the HIPAA-specific detection patterns run regardless of any suppression configuration.

Pipeline Performance

The DLP pipeline is designed for real-time operation. Typical latencies:

Tier	Typical Latency	Notes
Tier 1 (Pattern matching)	Sub-millisecond	Runs in-process, no network hop
Tier 2 (Entity recognition)	60-160ms	GPU-accelerated microservice
Tier 3 (Contextual validation)	17-90ms	GPU-accelerated microservice

All three tiers run with automatic health monitoring. If a GPU microservice becomes unavailable, the pipeline degrades gracefully — Tier 1 pattern matching always runs, and the remaining tiers contribute findings when available. This fail-open design ensures that gateway latency never exceeds a bounded threshold, even during infrastructure issues.