Intermediate

Query-Based Model Extraction

The most practical model extraction method: systematically query a target API, collect input-output pairs, and train a substitute model that replicates the target's behavior.

The Basic Attack Flow

Query-Based Extraction Pipeline
Step 1: Select Query Strategy
  Choose inputs that maximize information gain per query

Step 2: Query Target API
  Send crafted inputs, collect predictions (labels, probabilities, logits)

Step 3: Build Training Dataset
  Pair queries with target model responses as labeled data

Step 4: Train Substitute Model
  Use knowledge distillation to train a copy

Step 5: Evaluate Fidelity
  Measure agreement between substitute and target on held-out data

Step 6: Iterate
  Use active learning to query informative regions, improve fidelity

Query Strategies

StrategyDescriptionQuery Efficiency
Random SamplingSend random inputs from the input domainLow — many queries needed
Active LearningQuery inputs where the substitute model is most uncertainHigh — 10x fewer queries
Jacobian-BasedUse gradients of substitute to find informative queriesHigh — targets decision boundaries
Knockoff NetsUse natural data distribution + task-relevant augmentationMedium — good generalization
Data-Free DistillationGenerate synthetic queries using a generator networkHigh — no real data needed

Knowledge Distillation for Extraction

Python - Model Extraction via Distillation
import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_probs, temperature=3.0):
    """Soft label distillation from teacher (target API) outputs."""
    student_soft = F.log_softmax(student_logits / temperature, dim=-1)
    loss = F.kl_div(student_soft, teacher_probs, reduction="batchmean")
    return loss * (temperature ** 2)

def extract_model(target_api, student_model, query_set, epochs=50):
    """Extract target model via API queries."""
    optimizer = torch.optim.Adam(student_model.parameters())

    for epoch in range(epochs):
        for queries in query_set:
            # Query target API (the expensive part)
            teacher_probs = target_api.predict_proba(queries)

            # Train student to match teacher
            student_logits = student_model(queries)
            loss = distillation_loss(student_logits, teacher_probs)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return student_model

LLM-Specific Extraction

Extracting large language models has unique challenges and approaches:

  • Output distillation: Query the target LLM with diverse prompts and fine-tune a smaller model on the outputs
  • Logit extraction: If the API returns token probabilities, these provide richer training signal than just text
  • Task-specific extraction: Focus on extracting the model's behavior for a specific task rather than full capabilities
  • Chain-of-thought extraction: Capture the reasoning patterns along with the final answers
Extraction fidelity: With as few as 10,000-100,000 well-chosen queries, an attacker can create a substitute model that agrees with the target on 80-95% of inputs. For simpler models (decision trees, SVMs), near-perfect extraction is possible with even fewer queries.

What Information Aids Extraction?

Probability Outputs

APIs that return class probabilities leak far more information than hard labels. Each probability vector essentially provides a "soft label" for training.

Confidence Scores

Even a single confidence score reveals information about the decision boundary and helps the attacker focus queries on uncertain regions.

Embedding Access

APIs that expose embedding vectors allow direct comparison of the model's internal representations, making extraction far easier.

Error Messages

Detailed error messages can reveal input constraints, feature expectations, and model architecture details useful for choosing a substitute architecture.

Defense preview: The API Protection lesson (Lesson 4) covers how to limit the information exposed through API responses to make extraction harder, including output perturbation, prediction truncation, and query budget enforcement.