DLP Overview
The Arbitex Data Loss Prevention (DLP) pipeline inspects every request and response passing through the gateway. It identifies sensitive data — credit card numbers, personally identifiable information, protected health information, credentials, and custom patterns you define — and produces a list of findings: entity type, confidence score, and position in the text.
The DLP pipeline handles detection. The Policy Engine handles enforcement — it receives the DLP findings and decides what action to take (block, redact, allow, or route). The two systems are complementary: the DLP pipeline finds sensitive data, and the Policy Engine acts on what was found.
flowchart TD Input["Text Input"] Input --> T1["Tier 1: Pattern Matching\n65+ regex patterns + checksum validation\n<1ms latency"] T1 --> T2["Tier 2: Entity Recognition (NER)\nGPU-accelerated microservice\n60-160ms latency"] T2 --> T3["Tier 3: Contextual Validation\nDeBERTa NLI model\n17-90ms latency"] T3 --> Combine["Combine all findings"] Combine --> PE["Policy Engine\nevaluates findings\nagainst policy chain"]
style T1 fill:#e8f5e9 style T2 fill:#fff3e0 style T3 fill:#fce4ecHow the Pipeline Works
Section titled “How the Pipeline Works”The DLP pipeline runs three inspection tiers in sequence. Each tier adds a different detection capability, and findings from all tiers are combined before an enforcement action is taken.
Tier 1: Pattern Matching
Section titled “Tier 1: Pattern Matching”The first tier applies regular expression patterns against the full text of every request and response. These patterns detect structured sensitive data with well-defined formats:
- Payment card numbers (Visa, Mastercard, Amex, Discover, JCB, UnionPay, Diners Club) with Luhn checksum validation
- Government identifiers (Social Security numbers, passport formats, tax IDs)
- Financial identifiers (IBAN, SWIFT/BIC, routing numbers, ABA numbers)
- Healthcare identifiers (DEA numbers, NPI, medical record number formats)
- Network and infrastructure data (IPv4/IPv6 addresses, private keys, API tokens, connection strings)
Pattern matching runs in-process with sub-millisecond latency. Checksum validators (Luhn, IBAN, DEA, NPI, and others) reduce false positives by verifying that detected strings are structurally valid, not just format matches.
Tier 2: Entity Recognition
Section titled “Tier 2: Entity Recognition”The second tier uses named entity recognition (NER) to detect unstructured sensitive data that regular expressions cannot reliably match:
- Person names in free text
- Organization names and corporate identifiers
- Addresses and location references
- Medical terms and clinical language
- Dates of birth and age references in context
Entity recognition runs as a GPU-accelerated microservice. It processes text in parallel with Tier 1 — the pipeline does not wait for pattern matching to complete before starting entity recognition.
Tier 3: Contextual Validation
Section titled “Tier 3: Contextual Validation”The third tier applies a contextual natural language inference (NLI) model to validate findings from Tier 1 and Tier 2. This tier answers the question: “Is this finding actually sensitive data, or is it a coincidental match?”
For example, the string “Jordan Smith” might appear in a conversation about a basketball player (not sensitive) or in a medical record (sensitive). The contextual validator examines the surrounding text to determine whether the finding should be escalated or dismissed.
The contextual validator significantly reduces false positives without reducing detection coverage. It runs as a separate GPU-accelerated microservice with automatic health monitoring — if the service is unavailable, the pipeline continues with Tier 1 and Tier 2 findings only (fail-open).
How DLP Findings Drive Enforcement
Section titled “How DLP Findings Drive Enforcement”The DLP pipeline produces findings. The Policy Engine consumes those findings and determines the enforcement action. This separation means you configure what to detect in the DLP pipeline and what to do about it in your policy rules.
A DLP finding includes:
| Field | Description |
|---|---|
entity_type | The type of sensitive data detected (e.g., credit_card, ssn, person_name) |
tier | Which DLP tier produced the detection (1, 2, or 3) |
confidence | Detection confidence score (0.0–1.0) |
offset | Character position in the text where the finding starts |
length | Character length of the detected span |
Policy rules reference DLP findings through the entity_types and entity_confidence_min conditions. For example, a rule with entity_types: ["credit_card"] and entity_confidence_min: 0.85 fires only when the DLP pipeline detects a credit card number with at least 85% confidence. The rule’s action (BLOCK, REDACT, ALLOW, ROUTE_TO) determines what happens next.
See Policy Rule Reference for the full list of conditions that reference DLP findings, and Policy Engine overview for how enforcement decisions are made.
Detection Rule Sources
Section titled “Detection Rule Sources”The DLP pipeline runs detection patterns from three sources:
- Compliance Bundle patterns — detection rules from activated compliance frameworks (PCI-DSS, HIPAA, etc.). These patterns are read-only and cannot be suppressed. They run alongside all other detection rules.
- Organization custom patterns — regular expressions you define for your specific environment (e.g., internal project names, proprietary identifiers).
- Platform default patterns — the built-in pattern library. You can suppress individual platform defaults at the organization level if they produce false positives in your environment.
All detection sources contribute findings to the same output list. The Policy Engine evaluates those findings against your policy chain to determine the enforcement action. Enforcement precedence is controlled by the policy chain sequence order, not the detection source. See Policy Engine overview — Conflict resolution by sequencing.
Configuring Custom Rules
Section titled “Configuring Custom Rules”Organization admins can define custom DLP rules that run alongside the platform defaults. Each custom rule specifies:
- Name: A human-readable label for the rule
- Pattern: A regular expression that defines what to detect
- Action: Block, redact, or log
- Severity: Critical, high, medium, or low (affects audit log priority and alert routing)
Custom rules are cached per organization with a 60-second TTL. Changes to rules take effect within one minute across all gateway instances.
Suppressing Platform Defaults
Section titled “Suppressing Platform Defaults”If a platform default pattern produces false positives in your environment, you can suppress it at the organization level. Suppressed patterns stop producing findings for your organization but remain active for all other organizations on the platform.
Important: Compliance Bundle detection patterns cannot be suppressed. If your organization has activated HIPAA compliance, the HIPAA-specific detection patterns run regardless of any suppression configuration.
Pipeline Performance
Section titled “Pipeline Performance”The DLP pipeline is designed for real-time operation. Typical latencies:
| Tier | Typical Latency | Notes |
|---|---|---|
| Tier 1 (Pattern matching) | Sub-millisecond | Runs in-process, no network hop |
| Tier 2 (Entity recognition) | 60-160ms | GPU-accelerated microservice |
| Tier 3 (Contextual validation) | 17-90ms | GPU-accelerated microservice |
All three tiers run with automatic health monitoring. If a GPU microservice becomes unavailable, the pipeline degrades gracefully — Tier 1 pattern matching always runs, and the remaining tiers contribute findings when available. This fail-open design ensures that gateway latency never exceeds a bounded threshold, even during infrastructure issues.
See also
Section titled “See also”- Policy Engine overview — How DLP findings are consumed by the policy evaluation pipeline
- Policy Rule Reference — Conditions that reference DLP entity types and confidence scores
- Compliance Bundles — Pre-configured policy packs that include DLP detection patterns
- Audit Log — How DLP findings are recorded in audit entries
- Credential Intelligence — Breach corpus checks on credential-like material detected by DLP