Model Inversion & Privacy Attacks Intermediate
Machine learning models can inadvertently memorize and leak sensitive information about their training data. Privacy attacks exploit this memorization to extract personal data, determine whether specific individuals were in the training set, or reconstruct sensitive features. This lesson covers the major categories of privacy attacks and the defenses against them.
Privacy Attack Taxonomy
| Attack | Goal | Access Required | Information Extracted |
|---|---|---|---|
| Model Inversion | Reconstruct training data | White-box or black-box | Representative examples of training data |
| Membership Inference | Determine training set membership | Black-box | Whether a specific record was used for training |
| Attribute Inference | Infer sensitive attributes | Black-box | Missing or hidden features of training data |
| Data Extraction | Extract memorized data from LLMs | Black-box | Verbatim training data (PII, code, secrets) |
Model Inversion Attacks
Model inversion uses the model's outputs to reconstruct representative inputs for each class. For a facial recognition system, this means reconstructing approximate faces from the training data:
import torch def model_inversion(model, target_class, input_shape, num_steps=1000, lr=0.01): """Reconstruct a representative input for target_class.""" # Start with random noise x = torch.randn(input_shape, requires_grad=True) optimizer = torch.optim.Adam([x], lr=lr) for step in range(num_steps): optimizer.zero_grad() output = model(x.unsqueeze(0)) # Maximize confidence for target class loss = -output[0, target_class] # Add regularization for realistic images loss += 0.001 * torch.norm(x) loss.backward() optimizer.step() # Clamp to valid range x.data = torch.clamp(x.data, 0, 1) return x.detach()
Membership Inference
Membership inference determines whether a given data point was in the model's training set. The key insight is that models tend to be more confident on training data than on unseen data:
- Confidence-based — Training data typically receives higher confidence predictions
- Loss-based — Training data has lower loss values than non-training data
- Shadow model approach — Train shadow models to learn the membership signal, then use them to classify membership
- Metric-based — Compare model output metrics (entropy, modified entropy) between members and non-members
Training Data Extraction from LLMs
Large language models can memorize and reproduce verbatim training data. Researchers have shown that prompting LLMs with specific prefixes can cause them to complete with memorized content including:
- Personally identifiable information (names, addresses, phone numbers)
- API keys and passwords included in code repositories
- Copyrighted text passages
- Private conversations or emails
Privacy Defenses
| Defense | Mechanism | Trade-off |
|---|---|---|
| Differential Privacy | Add calibrated noise during training (DP-SGD) | Accuracy reduction proportional to privacy budget |
| Output Perturbation | Add noise to model outputs or round confidence scores | Reduced output precision |
| Regularization | L2 regularization, dropout to reduce memorization | May reduce model capacity |
| Knowledge Distillation | Train a student model that does not memorize individual examples | Slight accuracy loss |
Ready to Learn Defenses?
The next lesson covers comprehensive defense strategies including adversarial training, defensive distillation, and input transformation methods.
Next: Defenses →
Lilly Tech Systems