Intermediate

PII Redaction Techniques

Once PII is detected, the next step is redaction — removing or transforming the information to protect privacy while preserving data utility. Different use cases require different redaction strategies.

Redaction Strategies Overview

StrategyDescriptionReversibleData Utility
MaskingReplace with placeholder charactersNoLow
Type ReplacementReplace with entity type labelNoMedium
PseudonymizationReplace with fake but realistic dataWith mappingHigh
TokenizationReplace with random tokens, store mappingYesMedium
GeneralizationReplace with broader categoryNoMedium
DeletionRemove PII entirelyNoLow

1. Masking

Replace PII with placeholder characters like asterisks or X's. Simple but destroys data utility:

Python - Masking Redaction
def mask_pii(text: str, entities: list) -> str:
    # Sort by position (reverse) to preserve offsets
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        mask = "*" * len(entity["value"])
        text = text[:entity["start"]] + mask + text[entity["end"]:]

    return text

# "Contact John Smith at john@email.com"
# becomes: "Contact ********** at **************"

2. Type Replacement

Replace PII with its entity type label. Preserves sentence structure and context:

Python - Type Replacement
def replace_with_type(text: str, entities: list) -> str:
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        replacement = f"[{entity['type']}]"
        text = text[:entity["start"]] + replacement + text[entity["end"]:]

    return text

# "Contact John Smith at john@email.com"
# becomes: "Contact [PERSON] at [EMAIL]"

3. Pseudonymization

Replace PII with fake but realistic data. This preserves data utility for analytics and ML training while protecting identity:

Python - Pseudonymization with Faker
from faker import Faker
import hashlib

fake = Faker()
Faker.seed(42)  # Reproducible fakes

GENERATORS = {
    "PERSON": fake.name,
    "EMAIL": fake.email,
    "PHONE": fake.phone_number,
    "ADDRESS": fake.address,
    "SSN": fake.ssn,
    "CREDIT_CARD": fake.credit_card_number,
}

def pseudonymize(text: str, entities: list) -> str:
    # Consistent mapping: same input always gets same fake
    mapping = {}
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        key = entity["value"]
        if key not in mapping:
            generator = GENERATORS.get(entity["type"])
            mapping[key] = generator() if generator else "[REDACTED]"
        text = text[:entity["start"]] + mapping[key] + text[entity["end"]:]

    return text, mapping

# "Contact John Smith at john@email.com"
# becomes: "Contact Maria Garcia at fake42@example.net"

4. Tokenization

Replace PII with random tokens and store the mapping in a secure vault. This enables reversibility when authorized:

Python - Tokenization
import uuid

class PIITokenizer:
    def __init__(self):
        self.vault = {}  # In production, use encrypted storage

    def tokenize(self, text: str, entities: list) -> str:
        for entity in sorted(entities, key=lambda e: e["start"], reverse=True):
            token = f"TOK_{uuid.uuid4().hex[:8]}"
            self.vault[token] = entity["value"]
            text = text[:entity["start"]] + token + text[entity["end"]:]
        return text

    def detokenize(self, text: str) -> str:
        for token, value in self.vault.items():
            text = text.replace(token, value)
        return text

5. Generalization

Replace specific values with broader categories to reduce identifiability while retaining analytical value:

  • Age 34Age range 30-39
  • Zip code 02142State: Massachusetts
  • Date 03/15/1990Year: 1990
  • Salary $87,500Salary range: $80K-$90K
Choosing a strategy: Use type replacement for LLM input/output guardrails. Use pseudonymization when you need to preserve data relationships for analytics. Use tokenization when you need reversibility. Use generalization for reporting and aggregation.