Skip to content

DLP Regulatory Patterns

Arbitex ships four regulatory DLP detectors targeting US healthcare and tax identity data. These patterns extend the core DLP library with structured identifier formats that appear in enterprise AI workloads handling regulated data.

All four detectors are bundled in the regulatory pattern group. They are available in Outpost DLP rules without additional configuration when running a build that includes platform-0058 or later.


DetectorEntity Type LabelCompliance FrameworkDefault Sensitivity
Employer Identification NumberEINIRS, SOXmedium
National Drug CodeNDCFDA, HIPAAhigh
Healthcare Common Procedure Coding SystemHCPCSCMS, HIPAAhigh
ICD-10 Diagnostic CodeICD10WHO, HIPAAhigh

An EIN is a 9-digit number issued by the IRS to identify business entities for tax purposes.

Canonical format: XX-XXXXXXX

12-3456789

The first two digits are the EIN prefix (formerly indicated filing location; now alphanumeric routing). The remaining seven digits are sequential.

Accepted variants the detector normalizes:

VariantExample
Hyphenated (canonical)12-3456789
Digit-only (no separator)123456789
With spaces12 3456789

The detector matches all three variants. Normalization is applied before checksum validation.

EINs do not have a mathematical checksum (unlike SSNs). Validation is structural:

  1. Length — exactly 9 digits after stripping non-numeric characters.
  2. Prefix exclusion — reject prefixes 00, 07, 08, 09, 17, 18, 19, 28, 29, 49, 69, 70, 78, 79, 89, 96, 97 — these are unassigned or retired by the IRS.
  3. All-same-digit rejection11-1111111, 99-9999999, etc. are invalid.
  4. Context scoring — presence of adjacent tokens (EIN, FEIN, tax id, employer id, federal tax) raises the confidence score. Isolated digit strings without context are scored lower and may fall below the detection threshold.
FrameworkRelevance
IRS regulationsEINs are federal tax identifiers; their unauthorized disclosure or use is a federal offense
SOXSOX § 802 audit trail requirements cover financial identifiers including EINs in tax filings
GLBAEINs in loan origination and banking contexts fall within GLBA financial data scope
PCI-DSSEINs are not PCI-scoped, but may appear in merchant onboarding flows alongside PAN data

Common false positive sources:

  • Phone numbers — 9-digit strings in XX-XXXXXXX format overlap with US phone numbers. The detector rejects strings where the prefix matches known US area codes (200–999 range, excluding unassigned). Grounding context tokens help distinguish.
  • Account numbers — Internal account numbering schemes sometimes produce 9-digit hyphenated IDs. If your systems use a similar scheme, add those formats to the DLP allowlist (entity_allowlist in your policy bundle).
  • Zip+4 codesXXXXX-XXXX format does not match the 2+7 structure, so these are not triggered.

Recommended mitigation:

# In policy bundle: suppress EIN detection in specific contexts
dlp_rules:
- entity_type: EIN
context_required: true # only fire with adjacent EIN keywords
min_confidence: 0.85 # raise threshold above default 0.70

An NDC is an 11-digit product identifier assigned by the FDA to pharmaceutical manufacturers, distributors, and packagers.

Three-segment structure: LABELER-PRODUCT-PACKAGE

The FDA defines three hyphenation formats:

FormatSegment splitExample
4-4-2Labeler (4) + Product (4) + Package (2)0002-3227-01
5-3-2Labeler (5) + Product (3) + Package (2)00002-3227-01
5-4-1Labeler (5) + Product (4) + Package (1)00002-3227-1

The FDA NDC database uses the 10-digit representation (one leading zero suppressed); claims processing uses 11 digits. The detector matches all three hyphenation formats and the 10-digit variant without hyphens.

Accepted variants:

VariantExample
4-4-20002-3227-01
5-3-200002-3227-01
5-4-100002-3227-1
11-digit no separator00002322701
  1. Digit count — after stripping hyphens, exactly 10 or 11 digits.
  2. Segment bounds — each segment within the expected digit-count per the detected format.
  3. Labeler prefix validation — the detector checks the first 5 digits against the FDA Labeler Code registry snapshot (updated with each Arbitex release). Labeler codes not in the registry reduce confidence but do not suppress detection at default thresholds.
  4. Package code — package segment 00 through 99 only; alpha characters in package position are rejected.
FrameworkRelevance
HIPAANDCs appear in ePHI: prescriptions, pharmacy claims (NCPDP), and medication reconciliation records. Redaction is required when ePHI is processed outside covered entity context.
DEA regulationsControlled substance NDCs (Schedule II–V) have additional handling requirements under DEA 21 CFR.
FDA 21 CFR Part 207NDC assignment and disclosure rules govern how the code itself may be used commercially.
  • ZIP codes — 5-digit strings do not match NDC segment structure.
  • Phone numbers — 10-digit strings in phone format (area code 200–999) are excluded by the phone-number prior.
  • Internal product codes — if your internal SKU scheme produces NDC-shaped strings, add a regex allowlist entry scoped to your product code prefix range.
dlp_rules:
- entity_type: NDC
context_required: true # require 'NDC', 'drug code', 'medication' nearby
entity_allowlist:
- regex: "^9999-" # suppress internal SKU prefix 9999

HCPCS — Healthcare Common Procedure Coding System

Section titled “HCPCS — Healthcare Common Procedure Coding System”

HCPCS (pronounced “hick-picks”) codes are five-character alphanumeric identifiers used for Medicare and Medicaid billing. They divide into two levels:

LevelRangeDescription
Level I0010099499CPT codes (AMA-maintained) — numeric 5-digit
Level IIA0000V9999CMS alphanumeric — letter + 4 digits

Examples:

99213 # Level I: Office visit, established patient (15 min)
J0696 # Level II: Injection, ceftriaxone sodium
A0427 # Level II: Ambulance transport, ALS1

The detector matches both levels. Level II codes carry a single uppercase letter prefix (A–V, excluding letters not assigned by CMS: B, D, I, O, T was recently assigned, X).

  1. Level I — 5-digit numeric, range 0010099499. Codes outside CPT-assigned ranges reduce confidence.
  2. Level II — letter in [ACEFGHJKLMNPQRSUVWX] followed by 4 digits. The full 5-character string must match the pattern.
  3. Modifier codes — 2-character modifiers (e.g., LT, RT, 59) appended after a space or hyphen are stripped before matching; the modifier itself is not detected as a separate entity.
  4. Context scoring — adjacent tokens CPT, HCPCS, procedure code, billing code, claim raise confidence.
FrameworkRelevance
HIPAAHCPCS codes appear in claims and remittances (EDI 837/835) — ePHI when paired with patient identifiers
CMS regulationsImproper use of HCPCS codes constitutes Medicare/Medicaid fraud (False Claims Act exposure)
SOXHealthcare revenue recognition tied to HCPCS billing codes — relevant for public healthcare companies
  • Zip codes — 5-digit numeric strings can collide with Level I codes. Without medical context tokens, confidence is suppressed below the default detection threshold.
  • Short product codes — Level II format (A0000) is narrow enough that accidental collision with non-medical product codes is rare. The letter-prefix exclusion list (B, D, I, O) further reduces overlap.
  • Version identifiers — Software version strings like V2.4.0 are not 5-character and do not match.

ICD-10 — International Classification of Diseases, 10th Revision

Section titled “ICD-10 — International Classification of Diseases, 10th Revision”

ICD-10 codes are 3–7 alphanumeric characters identifying diagnoses (ICD-10-CM) and procedures (ICD-10-PCS).

ICD-10-CM (Clinical Modification — diagnoses):

A00.0 # Cholera due to Vibrio cholerae 01, biovar cholerae
M54.5 # Low back pain
Z23 # Encounter for immunization (3-character, no decimal)
S52.501A # Unspecified fracture of lower end of radius, initial encounter

Structure: [Letter][2 digits].[Up to 4 additional alphanumeric chars]

ICD-10-PCS (Procedure Coding System — inpatient procedures):

0BH17EZ # Insertion, trachea, via natural/artificial opening, internal fixation

Structure: 7 alphanumeric characters, no decimal.

ICD-10-CM:

  1. Header — first character is a letter (A–Z).
  2. Core — second and third characters are digits.
  3. Decimal expansion — optional decimal after position 3, followed by 1–4 alphanumeric characters.
  4. Chapter validation — the first character maps to a WHO ICD-10 chapter (A–Z except U, which is reserved). Chapter U codes (U00U99) are now used for COVID-19 and provisional assignments; the detector matches these.
  5. Length — minimum 3 characters (header + 2 digits); maximum 7 characters plus decimal.

ICD-10-PCS:

  1. Length — exactly 7 alphanumeric characters.
  2. Character set — digits and uppercase letters only; lowercase not valid.
  3. Section code — first character maps to a valid PCS section (0–9, B–D, F–H, X).

Shared logic:

  • Context scoring — adjacent tokens ICD, ICD-10, diagnosis code, DX, procedure code, PX, encounter for raise confidence materially.
  • Isolated short codes — 3-character codes without context are scored below default threshold to reduce false positives from accidental 3-character alphanumeric collisions.
FrameworkRelevance
HIPAAICD-10 codes are ePHI when paired with patient identifiers; disclosure without authorization violates HIPAA Privacy Rule
CMS / MedicareICD-10-CM is mandatory for all HIPAA-covered transactions since October 2015
SOXICD-10-based revenue cycle data is material financial information for public healthcare companies
FDA post-market surveillanceICD-10 codes appear in adverse event reports (FAERS); handling governed by FDA 21 CFR Part 803
  • Short alphanumeric IDs — 3-character codes with a letter prefix match a wide range of identifier formats. Require context_required: true in production policies.
  • Version stringsA1.2, B3.0 format matches short CM codes. The dot-decimal structure is the primary disambiguation; if false positives occur from version numbers, use a context token requirement.
  • Ticker symbols — Stock tickers (e.g., CVS, UNH) are 3-character letter-only; the CM format requires 2 digits in positions 2–3, so most tickers are excluded. Mixed-case tickers are fully excluded.
dlp_rules:
- entity_type: ICD10
context_required: true # require 'ICD', 'DX', 'diagnosis' nearby
min_confidence: 0.80

All four detectors are available when the regulatory pattern group is loaded. In your policy bundle:

{
"dlp_config": {
"enabled_groups": ["pii", "financial", "regulatory"],
"entity_sensitivity": {
"EIN": "medium",
"NDC": "high",
"HCPCS": "high",
"ICD10": "high"
}
}
}
{
"dlp_config": {
"entity_overrides": {
"EIN": {
"min_confidence": 0.85,
"context_required": true
},
"ICD10": {
"min_confidence": 0.80,
"context_required": true
}
}
}
}

Use the DLP simulation endpoint to test patterns:

Terminal window
curl -X POST https://api.arbitex.ai/api/admin/dlp/simulate \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Patient prescribed J0696 for infection, diagnosis M54.5, NDC 00002-3227-01",
"entity_types": ["NDC", "HCPCS", "ICD10"]
}'

Response includes matched spans, entity type, confidence score, and context tokens that influenced scoring.