Intermediate

Query-Based Model Extraction

The most practical model extraction method: systematically query a target API, collect input-output pairs, and train a substitute model that replicates the target's behavior.

The Basic Attack Flow

Query-Based Extraction Pipeline

Step 1: Select Query Strategy
  Choose inputs that maximize information gain per query

Step 2: Query Target API
  Send crafted inputs, collect predictions (labels, probabilities, logits)

Step 3: Build Training Dataset
  Pair queries with target model responses as labeled data

Step 4: Train Substitute Model
  Use knowledge distillation to train a copy

Step 5: Evaluate Fidelity
  Measure agreement between substitute and target on held-out data

Step 6: Iterate
  Use active learning to query informative regions, improve fidelity

Query Strategies

Strategy	Description	Query Efficiency
Random Sampling	Send random inputs from the input domain	Low — many queries needed
Active Learning	Query inputs where the substitute model is most uncertain	High — 10x fewer queries
Jacobian-Based	Use gradients of substitute to find informative queries	High — targets decision boundaries
Knockoff Nets	Use natural data distribution + task-relevant augmentation	Medium — good generalization
Data-Free Distillation	Generate synthetic queries using a generator network	High — no real data needed

Knowledge Distillation for Extraction

Python - Model Extraction via Distillation

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_probs, temperature=3.0):
    """Soft label distillation from teacher (target API) outputs."""
    student_soft = F.log_softmax(student_logits / temperature, dim=-1)
    loss = F.kl_div(student_soft, teacher_probs, reduction="batchmean")
    return loss * (temperature ** 2)

def extract_model(target_api, student_model, query_set, epochs=50):
    """Extract target model via API queries."""
    optimizer = torch.optim.Adam(student_model.parameters())

    for epoch in range(epochs):
        for queries in query_set:
            # Query target API (the expensive part)
            teacher_probs = target_api.predict_proba(queries)

            # Train student to match teacher
            student_logits = student_model(queries)
            loss = distillation_loss(student_logits, teacher_probs)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return student_model

LLM-Specific Extraction

Extracting large language models has unique challenges and approaches:

Output distillation: Query the target LLM with diverse prompts and fine-tune a smaller model on the outputs
Logit extraction: If the API returns token probabilities, these provide richer training signal than just text
Task-specific extraction: Focus on extracting the model's behavior for a specific task rather than full capabilities
Chain-of-thought extraction: Capture the reasoning patterns along with the final answers

⚠

Extraction fidelity: With as few as 10,000-100,000 well-chosen queries, an attacker can create a substitute model that agrees with the target on 80-95% of inputs. For simpler models (decision trees, SVMs), near-perfect extraction is possible with even fewer queries.

What Information Aids Extraction?

Probability Outputs

APIs that return class probabilities leak far more information than hard labels. Each probability vector essentially provides a "soft label" for training.

Confidence Scores

Even a single confidence score reveals information about the decision boundary and helps the attacker focus queries on uncertain regions.

Embedding Access

APIs that expose embedding vectors allow direct comparison of the model's internal representations, making extraction far easier.

Error Messages

Detailed error messages can reveal input constraints, feature expectations, and model architecture details useful for choosing a substitute architecture.

✅

Defense preview: The API Protection lesson (Lesson 4) covers how to limit the information exposed through API responses to make extraction harder, including output perturbation, prediction truncation, and query budget enforcement.

← Previous Introduction Next → Side-Channel Attacks