Intermediate

PII Detection Tools

Production PII detection relies on proven tools and frameworks. This lesson covers the leading open-source and cloud-based options, from Microsoft Presidio to LLM guardrail frameworks.

Microsoft Presidio

Presidio is an open-source SDK by Microsoft for PII detection and anonymization. It combines regex, NER, and custom recognizers in a modular architecture:

Python - Microsoft Presidio
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Detect PII
text = "John Smith's SSN is 123-45-6789 and email is john@example.com"
results = analyzer.analyze(
    text=text,
    language="en",
    entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"]
)

# Print detections
for r in results:
    print(f"{r.entity_type}: {text[r.start:r.end]} (score: {r.score})")

# Anonymize
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized.text)
# "<PERSON>'s SSN is <US_SSN> and email is <EMAIL_ADDRESS>"

Custom Presidio Recognizers

Python - Custom Recognizer
from presidio_analyzer import PatternRecognizer, Pattern

# Create a custom recognizer for employee IDs
emp_id_pattern = Pattern(
    name="employee_id",
    regex=r"\bEMP-\d{6}\b",
    score=0.9
)

emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_id_pattern]
)

# Add to analyzer
analyzer.registry.add_recognizer(emp_recognizer)

spaCy NER Pipelines

spaCy provides fast, production-ready NER models that serve as a foundation for PII detection:

Python - spaCy PII Pipeline
import spacy

# Load transformer-based model for best accuracy
nlp = spacy.load("en_core_web_trf")

# Add custom PII component
@spacy.Language.component("pii_detector")
def pii_detector(doc):
    pii_labels = {"PERSON", "ORG", "GPE", "DATE"}
    doc._.pii_entities = [
        ent for ent in doc.ents
        if ent.label_ in pii_labels
    ]
    return doc

# Register extension
from spacy.tokens import Doc
Doc.set_extension("pii_entities", default=[])
nlp.add_pipe("pii_detector", last=True)

doc = nlp("Dr. Maria Rodriguez from Mayo Clinic called on March 15.")
for ent in doc._.pii_entities:
    print(f"{ent.label_}: {ent.text}")

Cloud-Based PII Detection

AWS Comprehend

Python - AWS Comprehend PII Detection
import boto3

comprehend = boto3.client("comprehend")

response = comprehend.detect_pii_entities(
    Text="Call John at 555-123-4567 or john@email.com",
    LanguageCode="en"
)

for entity in response["Entities"]:
    print(f"{entity['Type']}: score {entity['Score']:.2f}")

Google Cloud DLP

Python - Google Cloud DLP
import google.cloud.dlp_v2

dlp = google.cloud.dlp_v2.DlpServiceClient()

inspect_config = {
    "info_types": [
        {"name": "PERSON_NAME"},
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "US_SOCIAL_SECURITY_NUMBER"},
    ],
    "min_likelihood": "LIKELY",
}

response = dlp.inspect_content(
    request={"parent": f"projects/{project_id}",
             "inspect_config": inspect_config,
             "item": {"value": text}}
)

LLM Guardrails for PII

LLM guardrail frameworks can intercept PII in prompts before they reach the model:

Python - Guardrails for LLM PII Protection
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

class LLMPIIGuardrail:
    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    def sanitize_input(self, prompt: str) -> str:
        """Strip PII from user prompt before sending to LLM."""
        results = self.analyzer.analyze(text=prompt, language="en")
        if results:
            anonymized = self.anonymizer.anonymize(text=prompt, analyzer_results=results)
            return anonymized.text
        return prompt

    def check_output(self, response: str) -> str:
        """Scan LLM response for leaked PII."""
        results = self.analyzer.analyze(text=response, language="en")
        if results:
            return self.anonymizer.anonymize(text=response, analyzer_results=results).text
        return response

Tool Comparison

ToolTypeLanguagesCustom EntitiesBest For
PresidioOpen-sourceManyYesFlexible, customizable pipelines
spaCyOpen-sourceManyYes (training)NER-focused detection
AWS ComprehendCloud APIManyLimitedAWS-native workflows
Google DLPCloud APIManyYesGCP-native, most PII types
Recommendation: Start with Microsoft Presidio for its flexibility, extensibility, and zero API cost. Add cloud services (AWS Comprehend, Google DLP) when you need their specific capabilities or are already on that cloud platform.