PII Detection Methods
Effective PII detection combines multiple approaches: rule-based regex patterns for structured data, NER models for names and entities, and ML classifiers for context-dependent identification. Each method has strengths and limitations.
1. Regex-Based Detection
Regular expressions are the foundation of PII detection for structured, predictable formats like SSNs, emails, and credit card numbers:
import re PII_PATTERNS = { "SSN": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), "EMAIL": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), "PHONE_US": re.compile(r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'), "CREDIT_CARD": re.compile(r'\b(?:\d{4}[-\s]?){3}\d{4}\b'), "IP_ADDRESS": re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'), "DATE_OF_BIRTH": re.compile(r'\b(?:0[1-9]|1[0-2])[/-](?:0[1-9]|[12]\d|3[01])[/-](?:19|20)\d{2}\b'), } def detect_pii_regex(text: str) -> list: findings = [] for pii_type, pattern in PII_PATTERNS.items(): for match in pattern.finditer(text): findings.append({ "type": pii_type, "value": match.group(), "start": match.start(), "end": match.end(), "confidence": 0.95 }) return findings
2. Named Entity Recognition (NER)
NER models identify and classify named entities in unstructured text. spaCy provides pre-trained NER models that detect PERSON, ORG, GPE, and other entity types:
import spacy nlp = spacy.load("en_core_web_trf") # Transformer-based model def detect_pii_ner(text: str) -> list: doc = nlp(text) pii_entities = [] # Map NER labels to PII categories pii_labels = {"PERSON", "ORG", "GPE", "DATE", "MONEY"} for ent in doc.ents: if ent.label_ in pii_labels: pii_entities.append({ "type": ent.label_, "value": ent.text, "start": ent.start_char, "end": ent.end_char, "confidence": 0.85 }) return pii_entities # Example usage text = "Dr. Sarah Johnson from Boston called about patient record #4521." results = detect_pii_ner(text) # [{"type": "PERSON", "value": "Sarah Johnson", ...}, # {"type": "GPE", "value": "Boston", ...}]
3. Transformer-Based ML Detection
Fine-tuned transformer models offer the highest accuracy for PII detection, especially for context-dependent cases:
from transformers import pipeline # Use a PII-specific NER model pii_detector = pipeline( "token-classification", model="lakshyakh93/deberta_finetuned_pii", aggregation_strategy="simple" ) text = "Contact John at john.doe@email.com or 555-123-4567" results = pii_detector(text) for entity in results: print(f"{entity['entity_group']}: {entity['word']} " f"(confidence: {entity['score']:.2f})")
4. LLM-Based Detection
Large Language Models can serve as sophisticated PII detectors, leveraging their contextual understanding:
import anthropic client = anthropic.Anthropic() def detect_pii_llm(text: str) -> dict: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": f"""Analyze the following text and identify ALL personally identifiable information (PII). Return JSON with: - entity_type (PERSON, EMAIL, PHONE, SSN, ADDRESS, etc.) - value (the detected PII text) - confidence (0.0 to 1.0) Text: {text}""" }] ) return response
Comparison of Detection Methods
| Method | Structured PII | Names/Entities | Context-Dependent | Speed | Cost |
|---|---|---|---|---|---|
| Regex | Excellent | Poor | None | Very fast | Free |
| spaCy NER | Poor | Good | Limited | Fast | Free |
| Transformer NER | Good | Excellent | Good | Medium | Compute |
| LLM-based | Good | Excellent | Excellent | Slow | API cost |
Ensemble Approach
Production PII detection systems combine multiple methods for maximum coverage:
def detect_pii_ensemble(text: str) -> list: # Layer 1: Fast regex for structured patterns regex_results = detect_pii_regex(text) # Layer 2: NER for names and entities ner_results = detect_pii_ner(text) # Layer 3: Merge and deduplicate all_results = merge_detections(regex_results, ner_results) # Layer 4: Confidence scoring for result in all_results: if result["detected_by"] == "both": result["confidence"] = min(result["confidence"] + 0.1, 1.0) return all_results
Lilly Tech Systems