Intermediate
PII Redaction Techniques
Once PII is detected, the next step is redaction — removing or transforming the information to protect privacy while preserving data utility. Different use cases require different redaction strategies.
Redaction Strategies Overview
| Strategy | Description | Reversible | Data Utility |
|---|---|---|---|
| Masking | Replace with placeholder characters | No | Low |
| Type Replacement | Replace with entity type label | No | Medium |
| Pseudonymization | Replace with fake but realistic data | With mapping | High |
| Tokenization | Replace with random tokens, store mapping | Yes | Medium |
| Generalization | Replace with broader category | No | Medium |
| Deletion | Remove PII entirely | No | Low |
1. Masking
Replace PII with placeholder characters like asterisks or X's. Simple but destroys data utility:
Python - Masking Redaction
def mask_pii(text: str, entities: list) -> str: # Sort by position (reverse) to preserve offsets sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True) for entity in sorted_entities: mask = "*" * len(entity["value"]) text = text[:entity["start"]] + mask + text[entity["end"]:] return text # "Contact John Smith at john@email.com" # becomes: "Contact ********** at **************"
2. Type Replacement
Replace PII with its entity type label. Preserves sentence structure and context:
Python - Type Replacement
def replace_with_type(text: str, entities: list) -> str: sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True) for entity in sorted_entities: replacement = f"[{entity['type']}]" text = text[:entity["start"]] + replacement + text[entity["end"]:] return text # "Contact John Smith at john@email.com" # becomes: "Contact [PERSON] at [EMAIL]"
3. Pseudonymization
Replace PII with fake but realistic data. This preserves data utility for analytics and ML training while protecting identity:
Python - Pseudonymization with Faker
from faker import Faker import hashlib fake = Faker() Faker.seed(42) # Reproducible fakes GENERATORS = { "PERSON": fake.name, "EMAIL": fake.email, "PHONE": fake.phone_number, "ADDRESS": fake.address, "SSN": fake.ssn, "CREDIT_CARD": fake.credit_card_number, } def pseudonymize(text: str, entities: list) -> str: # Consistent mapping: same input always gets same fake mapping = {} sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True) for entity in sorted_entities: key = entity["value"] if key not in mapping: generator = GENERATORS.get(entity["type"]) mapping[key] = generator() if generator else "[REDACTED]" text = text[:entity["start"]] + mapping[key] + text[entity["end"]:] return text, mapping # "Contact John Smith at john@email.com" # becomes: "Contact Maria Garcia at fake42@example.net"
4. Tokenization
Replace PII with random tokens and store the mapping in a secure vault. This enables reversibility when authorized:
Python - Tokenization
import uuid class PIITokenizer: def __init__(self): self.vault = {} # In production, use encrypted storage def tokenize(self, text: str, entities: list) -> str: for entity in sorted(entities, key=lambda e: e["start"], reverse=True): token = f"TOK_{uuid.uuid4().hex[:8]}" self.vault[token] = entity["value"] text = text[:entity["start"]] + token + text[entity["end"]:] return text def detokenize(self, text: str) -> str: for token, value in self.vault.items(): text = text.replace(token, value) return text
5. Generalization
Replace specific values with broader categories to reduce identifiability while retaining analytical value:
- Age 34 → Age range 30-39
- Zip code 02142 → State: Massachusetts
- Date 03/15/1990 → Year: 1990
- Salary $87,500 → Salary range: $80K-$90K
Choosing a strategy: Use type replacement for LLM input/output guardrails. Use pseudonymization when you need to preserve data relationships for analytics. Use tokenization when you need reversibility. Use generalization for reporting and aggregation.
Lilly Tech Systems