Intermediate

PII Detection Tools

Production PII detection relies on proven tools and frameworks. This lesson covers the leading open-source and cloud-based options, from Microsoft Presidio to LLM guardrail frameworks.

Microsoft Presidio

Presidio is an open-source SDK by Microsoft for PII detection and anonymization. It combines regex, NER, and custom recognizers in a modular architecture:

Python - Microsoft Presidio

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Detect PII
text = "John Smith's SSN is 123-45-6789 and email is john@example.com"
results = analyzer.analyze(
    text=text,
    language="en",
    entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"]
)

# Print detections
for r in results:
    print(f"{r.entity_type}: {text[r.start:r.end]} (score: {r.score})")

# Anonymize
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized.text)
# "<PERSON>'s SSN is <US_SSN> and email is <EMAIL_ADDRESS>"

Custom Presidio Recognizers

Python - Custom Recognizer

from presidio_analyzer import PatternRecognizer, Pattern

# Create a custom recognizer for employee IDs
emp_id_pattern = Pattern(
    name="employee_id",
    regex=r"\bEMP-\d{6}\b",
    score=0.9
)

emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_id_pattern]
)

# Add to analyzer
analyzer.registry.add_recognizer(emp_recognizer)

spaCy NER Pipelines

spaCy provides fast, production-ready NER models that serve as a foundation for PII detection:

Python - spaCy PII Pipeline

import spacy

# Load transformer-based model for best accuracy
nlp = spacy.load("en_core_web_trf")

# Add custom PII component
@spacy.Language.component("pii_detector")
def pii_detector(doc):
    pii_labels = {"PERSON", "ORG", "GPE", "DATE"}
    doc._.pii_entities = [
        ent for ent in doc.ents
        if ent.label_ in pii_labels
    ]
    return doc

# Register extension
from spacy.tokens import Doc
Doc.set_extension("pii_entities", default=[])
nlp.add_pipe("pii_detector", last=True)

doc = nlp("Dr. Maria Rodriguez from Mayo Clinic called on March 15.")
for ent in doc._.pii_entities:
    print(f"{ent.label_}: {ent.text}")

Cloud-Based PII Detection

AWS Comprehend

Python - AWS Comprehend PII Detection

import boto3

comprehend = boto3.client("comprehend")

response = comprehend.detect_pii_entities(
    Text="Call John at 555-123-4567 or john@email.com",
    LanguageCode="en"
)

for entity in response["Entities"]:
    print(f"{entity['Type']}: score {entity['Score']:.2f}")

Google Cloud DLP

Python - Google Cloud DLP

import google.cloud.dlp_v2

dlp = google.cloud.dlp_v2.DlpServiceClient()

inspect_config = {
    "info_types": [
        {"name": "PERSON_NAME"},
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "US_SOCIAL_SECURITY_NUMBER"},
    ],
    "min_likelihood": "LIKELY",
}

response = dlp.inspect_content(
    request={"parent": f"projects/{project_id}",
             "inspect_config": inspect_config,
             "item": {"value": text}}
)

LLM Guardrails for PII

LLM guardrail frameworks can intercept PII in prompts before they reach the model:

Python - Guardrails for LLM PII Protection

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

class LLMPIIGuardrail:
    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()

    def sanitize_input(self, prompt: str) -> str:
        """Strip PII from user prompt before sending to LLM."""
        results = self.analyzer.analyze(text=prompt, language="en")
        if results:
            anonymized = self.anonymizer.anonymize(text=prompt, analyzer_results=results)
            return anonymized.text
        return prompt

    def check_output(self, response: str) -> str:
        """Scan LLM response for leaked PII."""
        results = self.analyzer.analyze(text=response, language="en")
        if results:
            return self.anonymizer.anonymize(text=response, analyzer_results=results).text
        return response

Tool Comparison

Tool	Type	Languages	Custom Entities	Best For
Presidio	Open-source	Many	Yes	Flexible, customizable pipelines
spaCy	Open-source	Many	Yes (training)	NER-focused detection
AWS Comprehend	Cloud API	Many	Limited	AWS-native workflows
Google DLP	Cloud API	Many	Yes	GCP-native, most PII types

✅

Recommendation: Start with Microsoft Presidio for its flexibility, extensibility, and zero API cost. Add cloud services (AWS Comprehend, Google DLP) when you need their specific capabilities or are already on that cloud platform.

← Previous Redaction Next → Best Practices