Intermediate

PII Redaction Techniques

Once PII is detected, the next step is redaction — removing or transforming the information to protect privacy while preserving data utility. Different use cases require different redaction strategies.

Redaction Strategies Overview

Strategy	Description	Reversible	Data Utility
Masking	Replace with placeholder characters	No	Low
Type Replacement	Replace with entity type label	No	Medium
Pseudonymization	Replace with fake but realistic data	With mapping	High
Tokenization	Replace with random tokens, store mapping	Yes	Medium
Generalization	Replace with broader category	No	Medium
Deletion	Remove PII entirely	No	Low

1. Masking

Replace PII with placeholder characters like asterisks or X's. Simple but destroys data utility:

Python - Masking Redaction

def mask_pii(text: str, entities: list) -> str:
    # Sort by position (reverse) to preserve offsets
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        mask = "*" * len(entity["value"])
        text = text[:entity["start"]] + mask + text[entity["end"]:]

    return text

# "Contact John Smith at john@email.com"
# becomes: "Contact ********** at **************"

2. Type Replacement

Replace PII with its entity type label. Preserves sentence structure and context:

Python - Type Replacement

def replace_with_type(text: str, entities: list) -> str:
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        replacement = f"[{entity['type']}]"
        text = text[:entity["start"]] + replacement + text[entity["end"]:]

    return text

# "Contact John Smith at john@email.com"
# becomes: "Contact [PERSON] at [EMAIL]"

3. Pseudonymization

Replace PII with fake but realistic data. This preserves data utility for analytics and ML training while protecting identity:

Python - Pseudonymization with Faker

from faker import Faker
import hashlib

fake = Faker()
Faker.seed(42)  # Reproducible fakes

GENERATORS = {
    "PERSON": fake.name,
    "EMAIL": fake.email,
    "PHONE": fake.phone_number,
    "ADDRESS": fake.address,
    "SSN": fake.ssn,
    "CREDIT_CARD": fake.credit_card_number,
}

def pseudonymize(text: str, entities: list) -> str:
    # Consistent mapping: same input always gets same fake
    mapping = {}
    sorted_entities = sorted(entities, key=lambda e: e["start"], reverse=True)

    for entity in sorted_entities:
        key = entity["value"]
        if key not in mapping:
            generator = GENERATORS.get(entity["type"])
            mapping[key] = generator() if generator else "[REDACTED]"
        text = text[:entity["start"]] + mapping[key] + text[entity["end"]:]

    return text, mapping

# "Contact John Smith at john@email.com"
# becomes: "Contact Maria Garcia at fake42@example.net"

4. Tokenization

Replace PII with random tokens and store the mapping in a secure vault. This enables reversibility when authorized:

Python - Tokenization

import uuid

class PIITokenizer:
    def __init__(self):
        self.vault = {}  # In production, use encrypted storage

    def tokenize(self, text: str, entities: list) -> str:
        for entity in sorted(entities, key=lambda e: e["start"], reverse=True):
            token = f"TOK_{uuid.uuid4().hex[:8]}"
            self.vault[token] = entity["value"]
            text = text[:entity["start"]] + token + text[entity["end"]:]
        return text

    def detokenize(self, text: str) -> str:
        for token, value in self.vault.items():
            text = text.replace(token, value)
        return text

5. Generalization

Replace specific values with broader categories to reduce identifiability while retaining analytical value:

Age 34 → Age range 30-39
Zip code 02142 → State: Massachusetts
Date 03/15/1990 → Year: 1990
Salary $87,500 → Salary range: $80K-$90K

✅

Choosing a strategy: Use type replacement for LLM input/output guardrails. Use pseudonymization when you need to preserve data relationships for analytics. Use tokenization when you need reversibility. Use generalization for reporting and aggregation.

← Previous Detection Methods Next → Tools