DLP Regulatory Patterns
DLP Regulatory Patterns
Section titled “DLP Regulatory Patterns”Arbitex ships four regulatory DLP detectors targeting US healthcare and tax identity data. These patterns extend the core DLP library with structured identifier formats that appear in enterprise AI workloads handling regulated data.
All four detectors are bundled in the regulatory pattern group. They are available in Outpost DLP rules without additional configuration when running a build that includes platform-0058 or later.
Detector Overview
Section titled “Detector Overview”| Detector | Entity Type Label | Compliance Framework | Default Sensitivity |
|---|---|---|---|
| Employer Identification Number | EIN | IRS, SOX | medium |
| National Drug Code | NDC | FDA, HIPAA | high |
| Healthcare Common Procedure Coding System | HCPCS | CMS, HIPAA | high |
| ICD-10 Diagnostic Code | ICD10 | WHO, HIPAA | high |
EIN — Employer Identification Number
Section titled “EIN — Employer Identification Number”Format
Section titled “Format”An EIN is a 9-digit number issued by the IRS to identify business entities for tax purposes.
Canonical format: XX-XXXXXXX
12-3456789The first two digits are the EIN prefix (formerly indicated filing location; now alphanumeric routing). The remaining seven digits are sequential.
Accepted variants the detector normalizes:
| Variant | Example |
|---|---|
| Hyphenated (canonical) | 12-3456789 |
| Digit-only (no separator) | 123456789 |
| With spaces | 12 3456789 |
The detector matches all three variants. Normalization is applied before checksum validation.
Validation Logic
Section titled “Validation Logic”EINs do not have a mathematical checksum (unlike SSNs). Validation is structural:
- Length — exactly 9 digits after stripping non-numeric characters.
- Prefix exclusion — reject prefixes
00,07,08,09,17,18,19,28,29,49,69,70,78,79,89,96,97— these are unassigned or retired by the IRS. - All-same-digit rejection —
11-1111111,99-9999999, etc. are invalid. - Context scoring — presence of adjacent tokens (
EIN,FEIN,tax id,employer id,federal tax) raises the confidence score. Isolated digit strings without context are scored lower and may fall below the detection threshold.
Compliance Framework Mapping
Section titled “Compliance Framework Mapping”| Framework | Relevance |
|---|---|
| IRS regulations | EINs are federal tax identifiers; their unauthorized disclosure or use is a federal offense |
| SOX | SOX § 802 audit trail requirements cover financial identifiers including EINs in tax filings |
| GLBA | EINs in loan origination and banking contexts fall within GLBA financial data scope |
| PCI-DSS | EINs are not PCI-scoped, but may appear in merchant onboarding flows alongside PAN data |
False Positive Mitigation
Section titled “False Positive Mitigation”Common false positive sources:
- Phone numbers — 9-digit strings in
XX-XXXXXXXformat overlap with US phone numbers. The detector rejects strings where the prefix matches known US area codes (200–999 range, excluding unassigned). Grounding context tokens help distinguish. - Account numbers — Internal account numbering schemes sometimes produce 9-digit hyphenated IDs. If your systems use a similar scheme, add those formats to the DLP allowlist (
entity_allowlistin your policy bundle). - Zip+4 codes —
XXXXX-XXXXformat does not match the 2+7 structure, so these are not triggered.
Recommended mitigation:
# In policy bundle: suppress EIN detection in specific contextsdlp_rules: - entity_type: EIN context_required: true # only fire with adjacent EIN keywords min_confidence: 0.85 # raise threshold above default 0.70NDC — National Drug Code
Section titled “NDC — National Drug Code”Format
Section titled “Format”An NDC is an 11-digit product identifier assigned by the FDA to pharmaceutical manufacturers, distributors, and packagers.
Three-segment structure: LABELER-PRODUCT-PACKAGE
The FDA defines three hyphenation formats:
| Format | Segment split | Example |
|---|---|---|
| 4-4-2 | Labeler (4) + Product (4) + Package (2) | 0002-3227-01 |
| 5-3-2 | Labeler (5) + Product (3) + Package (2) | 00002-3227-01 |
| 5-4-1 | Labeler (5) + Product (4) + Package (1) | 00002-3227-1 |
The FDA NDC database uses the 10-digit representation (one leading zero suppressed); claims processing uses 11 digits. The detector matches all three hyphenation formats and the 10-digit variant without hyphens.
Accepted variants:
| Variant | Example |
|---|---|
| 4-4-2 | 0002-3227-01 |
| 5-3-2 | 00002-3227-01 |
| 5-4-1 | 00002-3227-1 |
| 11-digit no separator | 00002322701 |
Validation Logic
Section titled “Validation Logic”- Digit count — after stripping hyphens, exactly 10 or 11 digits.
- Segment bounds — each segment within the expected digit-count per the detected format.
- Labeler prefix validation — the detector checks the first 5 digits against the FDA Labeler Code registry snapshot (updated with each Arbitex release). Labeler codes not in the registry reduce confidence but do not suppress detection at default thresholds.
- Package code — package segment
00through99only; alpha characters in package position are rejected.
Compliance Framework Mapping
Section titled “Compliance Framework Mapping”| Framework | Relevance |
|---|---|
| HIPAA | NDCs appear in ePHI: prescriptions, pharmacy claims (NCPDP), and medication reconciliation records. Redaction is required when ePHI is processed outside covered entity context. |
| DEA regulations | Controlled substance NDCs (Schedule II–V) have additional handling requirements under DEA 21 CFR. |
| FDA 21 CFR Part 207 | NDC assignment and disclosure rules govern how the code itself may be used commercially. |
False Positive Mitigation
Section titled “False Positive Mitigation”- ZIP codes — 5-digit strings do not match NDC segment structure.
- Phone numbers — 10-digit strings in phone format (area code 200–999) are excluded by the phone-number prior.
- Internal product codes — if your internal SKU scheme produces NDC-shaped strings, add a regex allowlist entry scoped to your product code prefix range.
dlp_rules: - entity_type: NDC context_required: true # require 'NDC', 'drug code', 'medication' nearby entity_allowlist: - regex: "^9999-" # suppress internal SKU prefix 9999HCPCS — Healthcare Common Procedure Coding System
Section titled “HCPCS — Healthcare Common Procedure Coding System”Format
Section titled “Format”HCPCS (pronounced “hick-picks”) codes are five-character alphanumeric identifiers used for Medicare and Medicaid billing. They divide into two levels:
| Level | Range | Description |
|---|---|---|
| Level I | 00100–99499 | CPT codes (AMA-maintained) — numeric 5-digit |
| Level II | A0000–V9999 | CMS alphanumeric — letter + 4 digits |
Examples:
99213 # Level I: Office visit, established patient (15 min)J0696 # Level II: Injection, ceftriaxone sodiumA0427 # Level II: Ambulance transport, ALS1The detector matches both levels. Level II codes carry a single uppercase letter prefix (A–V, excluding letters not assigned by CMS: B, D, I, O, T was recently assigned, X).
Validation Logic
Section titled “Validation Logic”- Level I — 5-digit numeric, range
00100–99499. Codes outside CPT-assigned ranges reduce confidence. - Level II — letter in
[ACEFGHJKLMNPQRSUVWX]followed by 4 digits. The full 5-character string must match the pattern. - Modifier codes — 2-character modifiers (e.g.,
LT,RT,59) appended after a space or hyphen are stripped before matching; the modifier itself is not detected as a separate entity. - Context scoring — adjacent tokens
CPT,HCPCS,procedure code,billing code,claimraise confidence.
Compliance Framework Mapping
Section titled “Compliance Framework Mapping”| Framework | Relevance |
|---|---|
| HIPAA | HCPCS codes appear in claims and remittances (EDI 837/835) — ePHI when paired with patient identifiers |
| CMS regulations | Improper use of HCPCS codes constitutes Medicare/Medicaid fraud (False Claims Act exposure) |
| SOX | Healthcare revenue recognition tied to HCPCS billing codes — relevant for public healthcare companies |
False Positive Mitigation
Section titled “False Positive Mitigation”- Zip codes — 5-digit numeric strings can collide with Level I codes. Without medical context tokens, confidence is suppressed below the default detection threshold.
- Short product codes — Level II format (
A0000) is narrow enough that accidental collision with non-medical product codes is rare. The letter-prefix exclusion list (B, D, I, O) further reduces overlap. - Version identifiers — Software version strings like
V2.4.0are not 5-character and do not match.
ICD-10 — International Classification of Diseases, 10th Revision
Section titled “ICD-10 — International Classification of Diseases, 10th Revision”Format
Section titled “Format”ICD-10 codes are 3–7 alphanumeric characters identifying diagnoses (ICD-10-CM) and procedures (ICD-10-PCS).
ICD-10-CM (Clinical Modification — diagnoses):
A00.0 # Cholera due to Vibrio cholerae 01, biovar choleraeM54.5 # Low back painZ23 # Encounter for immunization (3-character, no decimal)S52.501A # Unspecified fracture of lower end of radius, initial encounterStructure: [Letter][2 digits].[Up to 4 additional alphanumeric chars]
ICD-10-PCS (Procedure Coding System — inpatient procedures):
0BH17EZ # Insertion, trachea, via natural/artificial opening, internal fixationStructure: 7 alphanumeric characters, no decimal.
Validation Logic
Section titled “Validation Logic”ICD-10-CM:
- Header — first character is a letter (A–Z).
- Core — second and third characters are digits.
- Decimal expansion — optional decimal after position 3, followed by 1–4 alphanumeric characters.
- Chapter validation — the first character maps to a WHO ICD-10 chapter (A–Z except U, which is reserved). Chapter U codes (
U00–U99) are now used for COVID-19 and provisional assignments; the detector matches these. - Length — minimum 3 characters (header + 2 digits); maximum 7 characters plus decimal.
ICD-10-PCS:
- Length — exactly 7 alphanumeric characters.
- Character set — digits and uppercase letters only; lowercase not valid.
- Section code — first character maps to a valid PCS section (0–9, B–D, F–H, X).
Shared logic:
- Context scoring — adjacent tokens
ICD,ICD-10,diagnosis code,DX,procedure code,PX,encounter forraise confidence materially. - Isolated short codes — 3-character codes without context are scored below default threshold to reduce false positives from accidental 3-character alphanumeric collisions.
Compliance Framework Mapping
Section titled “Compliance Framework Mapping”| Framework | Relevance |
|---|---|
| HIPAA | ICD-10 codes are ePHI when paired with patient identifiers; disclosure without authorization violates HIPAA Privacy Rule |
| CMS / Medicare | ICD-10-CM is mandatory for all HIPAA-covered transactions since October 2015 |
| SOX | ICD-10-based revenue cycle data is material financial information for public healthcare companies |
| FDA post-market surveillance | ICD-10 codes appear in adverse event reports (FAERS); handling governed by FDA 21 CFR Part 803 |
False Positive Mitigation
Section titled “False Positive Mitigation”- Short alphanumeric IDs — 3-character codes with a letter prefix match a wide range of identifier formats. Require
context_required: truein production policies. - Version strings —
A1.2,B3.0format matches short CM codes. The dot-decimal structure is the primary disambiguation; if false positives occur from version numbers, use a context token requirement. - Ticker symbols — Stock tickers (e.g.,
CVS,UNH) are 3-character letter-only; the CM format requires 2 digits in positions 2–3, so most tickers are excluded. Mixed-case tickers are fully excluded.
dlp_rules: - entity_type: ICD10 context_required: true # require 'ICD', 'DX', 'diagnosis' nearby min_confidence: 0.80Configuring Regulatory Detectors
Section titled “Configuring Regulatory Detectors”Enable in Policy Bundle
Section titled “Enable in Policy Bundle”All four detectors are available when the regulatory pattern group is loaded. In your policy bundle:
{ "dlp_config": { "enabled_groups": ["pii", "financial", "regulatory"], "entity_sensitivity": { "EIN": "medium", "NDC": "high", "HCPCS": "high", "ICD10": "high" } }}Per-Detector Tuning
Section titled “Per-Detector Tuning”{ "dlp_config": { "entity_overrides": { "EIN": { "min_confidence": 0.85, "context_required": true }, "ICD10": { "min_confidence": 0.80, "context_required": true } } }}Verify Detection
Section titled “Verify Detection”Use the DLP simulation endpoint to test patterns:
curl -X POST https://api.arbitex.ai/api/admin/dlp/simulate \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "text": "Patient prescribed J0696 for infection, diagnosis M54.5, NDC 00002-3227-01", "entity_types": ["NDC", "HCPCS", "ICD10"] }'Response includes matched spans, entity type, confidence score, and context tokens that influenced scoring.