Query-Based Model Extraction
The most practical model extraction method: systematically query a target API, collect input-output pairs, and train a substitute model that replicates the target's behavior.
The Basic Attack Flow
Step 1: Select Query Strategy Choose inputs that maximize information gain per query Step 2: Query Target API Send crafted inputs, collect predictions (labels, probabilities, logits) Step 3: Build Training Dataset Pair queries with target model responses as labeled data Step 4: Train Substitute Model Use knowledge distillation to train a copy Step 5: Evaluate Fidelity Measure agreement between substitute and target on held-out data Step 6: Iterate Use active learning to query informative regions, improve fidelity
Query Strategies
| Strategy | Description | Query Efficiency |
|---|---|---|
| Random Sampling | Send random inputs from the input domain | Low — many queries needed |
| Active Learning | Query inputs where the substitute model is most uncertain | High — 10x fewer queries |
| Jacobian-Based | Use gradients of substitute to find informative queries | High — targets decision boundaries |
| Knockoff Nets | Use natural data distribution + task-relevant augmentation | Medium — good generalization |
| Data-Free Distillation | Generate synthetic queries using a generator network | High — no real data needed |
Knowledge Distillation for Extraction
import torch import torch.nn.functional as F def distillation_loss(student_logits, teacher_probs, temperature=3.0): """Soft label distillation from teacher (target API) outputs.""" student_soft = F.log_softmax(student_logits / temperature, dim=-1) loss = F.kl_div(student_soft, teacher_probs, reduction="batchmean") return loss * (temperature ** 2) def extract_model(target_api, student_model, query_set, epochs=50): """Extract target model via API queries.""" optimizer = torch.optim.Adam(student_model.parameters()) for epoch in range(epochs): for queries in query_set: # Query target API (the expensive part) teacher_probs = target_api.predict_proba(queries) # Train student to match teacher student_logits = student_model(queries) loss = distillation_loss(student_logits, teacher_probs) optimizer.zero_grad() loss.backward() optimizer.step() return student_model
LLM-Specific Extraction
Extracting large language models has unique challenges and approaches:
- Output distillation: Query the target LLM with diverse prompts and fine-tune a smaller model on the outputs
- Logit extraction: If the API returns token probabilities, these provide richer training signal than just text
- Task-specific extraction: Focus on extracting the model's behavior for a specific task rather than full capabilities
- Chain-of-thought extraction: Capture the reasoning patterns along with the final answers
What Information Aids Extraction?
Probability Outputs
APIs that return class probabilities leak far more information than hard labels. Each probability vector essentially provides a "soft label" for training.
Confidence Scores
Even a single confidence score reveals information about the decision boundary and helps the attacker focus queries on uncertain regions.
Embedding Access
APIs that expose embedding vectors allow direct comparison of the model's internal representations, making extraction far easier.
Error Messages
Detailed error messages can reveal input constraints, feature expectations, and model architecture details useful for choosing a substitute architecture.
Lilly Tech Systems