Intermediate
PII Detection Tools
Production PII detection relies on proven tools and frameworks. This lesson covers the leading open-source and cloud-based options, from Microsoft Presidio to LLM guardrail frameworks.
Microsoft Presidio
Presidio is an open-source SDK by Microsoft for PII detection and anonymization. It combines regex, NER, and custom recognizers in a modular architecture:
Python - Microsoft Presidio
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine # Initialize engines analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() # Detect PII text = "John Smith's SSN is 123-45-6789 and email is john@example.com" results = analyzer.analyze( text=text, language="en", entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"] ) # Print detections for r in results: print(f"{r.entity_type}: {text[r.start:r.end]} (score: {r.score})") # Anonymize anonymized = anonymizer.anonymize(text=text, analyzer_results=results) print(anonymized.text) # "<PERSON>'s SSN is <US_SSN> and email is <EMAIL_ADDRESS>"
Custom Presidio Recognizers
Python - Custom Recognizer
from presidio_analyzer import PatternRecognizer, Pattern # Create a custom recognizer for employee IDs emp_id_pattern = Pattern( name="employee_id", regex=r"\bEMP-\d{6}\b", score=0.9 ) emp_recognizer = PatternRecognizer( supported_entity="EMPLOYEE_ID", patterns=[emp_id_pattern] ) # Add to analyzer analyzer.registry.add_recognizer(emp_recognizer)
spaCy NER Pipelines
spaCy provides fast, production-ready NER models that serve as a foundation for PII detection:
Python - spaCy PII Pipeline
import spacy # Load transformer-based model for best accuracy nlp = spacy.load("en_core_web_trf") # Add custom PII component @spacy.Language.component("pii_detector") def pii_detector(doc): pii_labels = {"PERSON", "ORG", "GPE", "DATE"} doc._.pii_entities = [ ent for ent in doc.ents if ent.label_ in pii_labels ] return doc # Register extension from spacy.tokens import Doc Doc.set_extension("pii_entities", default=[]) nlp.add_pipe("pii_detector", last=True) doc = nlp("Dr. Maria Rodriguez from Mayo Clinic called on March 15.") for ent in doc._.pii_entities: print(f"{ent.label_}: {ent.text}")
Cloud-Based PII Detection
AWS Comprehend
Python - AWS Comprehend PII Detection
import boto3 comprehend = boto3.client("comprehend") response = comprehend.detect_pii_entities( Text="Call John at 555-123-4567 or john@email.com", LanguageCode="en" ) for entity in response["Entities"]: print(f"{entity['Type']}: score {entity['Score']:.2f}")
Google Cloud DLP
Python - Google Cloud DLP
import google.cloud.dlp_v2 dlp = google.cloud.dlp_v2.DlpServiceClient() inspect_config = { "info_types": [ {"name": "PERSON_NAME"}, {"name": "EMAIL_ADDRESS"}, {"name": "PHONE_NUMBER"}, {"name": "US_SOCIAL_SECURITY_NUMBER"}, ], "min_likelihood": "LIKELY", } response = dlp.inspect_content( request={"parent": f"projects/{project_id}", "inspect_config": inspect_config, "item": {"value": text}} )
LLM Guardrails for PII
LLM guardrail frameworks can intercept PII in prompts before they reach the model:
Python - Guardrails for LLM PII Protection
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine class LLMPIIGuardrail: def __init__(self): self.analyzer = AnalyzerEngine() self.anonymizer = AnonymizerEngine() def sanitize_input(self, prompt: str) -> str: """Strip PII from user prompt before sending to LLM.""" results = self.analyzer.analyze(text=prompt, language="en") if results: anonymized = self.anonymizer.anonymize(text=prompt, analyzer_results=results) return anonymized.text return prompt def check_output(self, response: str) -> str: """Scan LLM response for leaked PII.""" results = self.analyzer.analyze(text=response, language="en") if results: return self.anonymizer.anonymize(text=response, analyzer_results=results).text return response
Tool Comparison
| Tool | Type | Languages | Custom Entities | Best For |
|---|---|---|---|---|
| Presidio | Open-source | Many | Yes | Flexible, customizable pipelines |
| spaCy | Open-source | Many | Yes (training) | NER-focused detection |
| AWS Comprehend | Cloud API | Many | Limited | AWS-native workflows |
| Google DLP | Cloud API | Many | Yes | GCP-native, most PII types |
Recommendation: Start with Microsoft Presidio for its flexibility, extensibility, and zero API cost. Add cloud services (AWS Comprehend, Google DLP) when you need their specific capabilities or are already on that cloud platform.
Lilly Tech Systems