Model Inversion & Privacy Attacks Intermediate

Machine learning models can inadvertently memorize and leak sensitive information about their training data. Privacy attacks exploit this memorization to extract personal data, determine whether specific individuals were in the training set, or reconstruct sensitive features. This lesson covers the major categories of privacy attacks and the defenses against them.

Privacy Attack Taxonomy

Attack Goal Access Required Information Extracted
Model Inversion Reconstruct training data White-box or black-box Representative examples of training data
Membership Inference Determine training set membership Black-box Whether a specific record was used for training
Attribute Inference Infer sensitive attributes Black-box Missing or hidden features of training data
Data Extraction Extract memorized data from LLMs Black-box Verbatim training data (PII, code, secrets)

Model Inversion Attacks

Model inversion uses the model's outputs to reconstruct representative inputs for each class. For a facial recognition system, this means reconstructing approximate faces from the training data:

Python (Conceptual)
import torch

def model_inversion(model, target_class, input_shape,
                     num_steps=1000, lr=0.01):
    """Reconstruct a representative input for target_class."""
    # Start with random noise
    x = torch.randn(input_shape, requires_grad=True)
    optimizer = torch.optim.Adam([x], lr=lr)

    for step in range(num_steps):
        optimizer.zero_grad()
        output = model(x.unsqueeze(0))

        # Maximize confidence for target class
        loss = -output[0, target_class]

        # Add regularization for realistic images
        loss += 0.001 * torch.norm(x)

        loss.backward()
        optimizer.step()

        # Clamp to valid range
        x.data = torch.clamp(x.data, 0, 1)

    return x.detach()

Membership Inference

Membership inference determines whether a given data point was in the model's training set. The key insight is that models tend to be more confident on training data than on unseen data:

  • Confidence-based — Training data typically receives higher confidence predictions
  • Loss-based — Training data has lower loss values than non-training data
  • Shadow model approach — Train shadow models to learn the membership signal, then use them to classify membership
  • Metric-based — Compare model output metrics (entropy, modified entropy) between members and non-members
Why It Matters: Membership inference can reveal whether a specific individual's data was used for training, which has direct implications for GDPR, HIPAA, and other data privacy regulations. A model that is vulnerable to membership inference may not comply with privacy requirements.

Training Data Extraction from LLMs

Large language models can memorize and reproduce verbatim training data. Researchers have shown that prompting LLMs with specific prefixes can cause them to complete with memorized content including:

  • Personally identifiable information (names, addresses, phone numbers)
  • API keys and passwords included in code repositories
  • Copyrighted text passages
  • Private conversations or emails

Privacy Defenses

Defense Mechanism Trade-off
Differential Privacy Add calibrated noise during training (DP-SGD) Accuracy reduction proportional to privacy budget
Output Perturbation Add noise to model outputs or round confidence scores Reduced output precision
Regularization L2 regularization, dropout to reduce memorization May reduce model capacity
Knowledge Distillation Train a student model that does not memorize individual examples Slight accuracy loss

Ready to Learn Defenses?

The next lesson covers comprehensive defense strategies including adversarial training, defensive distillation, and input transformation methods.

Next: Defenses →