Model Inversion & Privacy Attacks Intermediate

Machine learning models can inadvertently memorize and leak sensitive information about their training data. Privacy attacks exploit this memorization to extract personal data, determine whether specific individuals were in the training set, or reconstruct sensitive features. This lesson covers the major categories of privacy attacks and the defenses against them.

Privacy Attack Taxonomy

Attack	Goal	Access Required	Information Extracted
Model Inversion	Reconstruct training data	White-box or black-box	Representative examples of training data
Membership Inference	Determine training set membership	Black-box	Whether a specific record was used for training
Attribute Inference	Infer sensitive attributes	Black-box	Missing or hidden features of training data
Data Extraction	Extract memorized data from LLMs	Black-box	Verbatim training data (PII, code, secrets)

Model Inversion Attacks

Model inversion uses the model's outputs to reconstruct representative inputs for each class. For a facial recognition system, this means reconstructing approximate faces from the training data:

Python (Conceptual)

import torch

def model_inversion(model, target_class, input_shape,
                     num_steps=1000, lr=0.01):
    """Reconstruct a representative input for target_class."""
    # Start with random noise
    x = torch.randn(input_shape, requires_grad=True)
    optimizer = torch.optim.Adam([x], lr=lr)

    for step in range(num_steps):
        optimizer.zero_grad()
        output = model(x.unsqueeze(0))

        # Maximize confidence for target class
        loss = -output[0, target_class]

        # Add regularization for realistic images
        loss += 0.001 * torch.norm(x)

        loss.backward()
        optimizer.step()

        # Clamp to valid range
        x.data = torch.clamp(x.data, 0, 1)

    return x.detach()

Membership Inference

Membership inference determines whether a given data point was in the model's training set. The key insight is that models tend to be more confident on training data than on unseen data:

Confidence-based — Training data typically receives higher confidence predictions
Loss-based — Training data has lower loss values than non-training data
Shadow model approach — Train shadow models to learn the membership signal, then use them to classify membership
Metric-based — Compare model output metrics (entropy, modified entropy) between members and non-members

Why It Matters: Membership inference can reveal whether a specific individual's data was used for training, which has direct implications for GDPR, HIPAA, and other data privacy regulations. A model that is vulnerable to membership inference may not comply with privacy requirements.

Training Data Extraction from LLMs

Large language models can memorize and reproduce verbatim training data. Researchers have shown that prompting LLMs with specific prefixes can cause them to complete with memorized content including:

Personally identifiable information (names, addresses, phone numbers)
API keys and passwords included in code repositories
Copyrighted text passages
Private conversations or emails

Privacy Defenses

Defense	Mechanism	Trade-off
Differential Privacy	Add calibrated noise during training (DP-SGD)	Accuracy reduction proportional to privacy budget
Output Perturbation	Add noise to model outputs or round confidence scores	Reduced output precision
Regularization	L2 regularization, dropout to reduce memorization	May reduce model capacity
Knowledge Distillation	Train a student model that does not memorize individual examples	Slight accuracy loss

Ready to Learn Defenses?

The next lesson covers comprehensive defense strategies including adversarial training, defensive distillation, and input transformation methods.

Next: Defenses →

← Poisoning Attacks Defenses →