Black-Box Attack Methods

Lesson 3 of 7 in the Adversarial Attacks & Defenses course.

Black-Box Adversarial Attacks

Black-box attacks operate without access to the model's internal parameters or gradients. The attacker can only query the model with inputs and observe the outputs (predictions, confidence scores, or labels). This is the most realistic threat model for attacking deployed ML systems accessible via APIs.

Despite the limited information, black-box attacks can be surprisingly effective through two main strategies: transfer-based attacks and query-based attacks.

Transfer-Based Attacks

Transfer attacks exploit a key property of adversarial examples: they often transfer across different models. An attacker can train a local substitute model and generate adversarial examples against it, which then fool the target model as well.

  • Substitute model training: Train a model on a similar task using publicly available data or data collected by querying the target model
  • Generate adversarial examples: Use white-box attacks (FGSM, PGD) on the substitute model
  • Transfer: Apply the same adversarial examples to the target model
  • Success rate: Typically 30-70% transfer rate depending on model similarity
Python
import torch
import torch.nn.functional as F

class TransferAttack:
    """Transfer-based black-box attack using a substitute model."""

    def __init__(self, substitute_model, epsilon=0.03, num_steps=20, alpha=0.003):
        self.substitute = substitute_model
        self.epsilon = epsilon
        self.num_steps = num_steps
        self.alpha = alpha

    def generate(self, images, labels):
        """Generate adversarial examples on substitute, transfer to target."""
        adv_images = images.clone().detach()
        adv_images += torch.empty_like(adv_images).uniform_(
            -self.epsilon, self.epsilon
        )
        adv_images = torch.clamp(adv_images, 0.0, 1.0)

        for _ in range(self.num_steps):
            adv_images.requires_grad_(True)
            outputs = self.substitute(adv_images)
            loss = F.cross_entropy(outputs, labels)
            self.substitute.zero_grad()
            loss.backward()

            with torch.no_grad():
                adv_images = adv_images + self.alpha * adv_images.grad.sign()
                perturbation = torch.clamp(adv_images - images,
                                           -self.epsilon, self.epsilon)
                adv_images = torch.clamp(images + perturbation, 0.0, 1.0)

        return adv_images.detach()

    def evaluate_transfer(self, target_model, adv_images, labels):
        """Measure transfer success rate."""
        with torch.no_grad():
            predictions = target_model(adv_images).argmax(dim=1)
            success_rate = (predictions != labels).float().mean().item()
        return success_rate

Query-Based Attacks

Query-based attacks directly estimate gradients or optimize perturbations by querying the target model many times. They trade queries for effectiveness:

  • Score-based attacks: Use prediction confidence scores to estimate gradients via finite differences or random sampling
  • Decision-based attacks: Only use the top-1 prediction label, requiring more queries but less information per query
  • Gradient estimation: Methods like Natural Evolution Strategies (NES) or Simultaneous Perturbation Stochastic Approximation (SPSA) estimate gradients from query results
💡
Defense insight: Monitoring query patterns is one of the most effective defenses against black-box attacks. A sudden surge of similar queries from a single user, or queries that systematically explore the decision boundary, are strong indicators of an attack in progress.

Ensemble Transfer Attacks

Using an ensemble of substitute models significantly improves transfer rates:

  1. Train multiple substitute models with different architectures (ResNet, VGG, DenseNet)
  2. Generate adversarial examples that fool all substitute models simultaneously
  3. The resulting examples are more likely to exploit universal vulnerabilities that transfer to the target
  4. Ensemble attacks can achieve 60-90% transfer rates compared to 30-50% for single-model transfer

Practical Query Budget Considerations

In real-world attacks, every query costs time and money, and may be logged:

  • Typical API-based attacks require 1,000 to 100,000 queries per adversarial example
  • Rate limiting and query budgeting are effective defenses that increase attack cost
  • Hybrid approaches (transfer + limited queries) can reduce query requirements significantly
Python
# Query budget analysis for different black-box attacks
ATTACK_QUERY_BUDGETS = {
    "Transfer (no queries)": {"queries": 0, "success_rate": "30-50%"},
    "Ensemble Transfer": {"queries": 0, "success_rate": "60-80%"},
    "NES Gradient Estimation": {"queries": "~5000", "success_rate": "85-95%"},
    "SPSA": {"queries": "~2000", "success_rate": "80-90%"},
    "Boundary Attack": {"queries": "~25000", "success_rate": "90-99%"},
    "HopSkipJump": {"queries": "~5000", "success_rate": "90-95%"},
    "SimBA (Simple Black-box)": {"queries": "~3000", "success_rate": "85-95%"},
}

print(f"{'Attack Method':<30} {'Queries':>10} {'Success Rate':>15}")
print("-" * 60)
for method, info in ATTACK_QUERY_BUDGETS.items():
    print(f"{method:<30} {str(info['queries']):>10} {info['success_rate']:>15}")
Warning: Removing confidence scores from API responses (returning only class labels) does NOT prevent black-box attacks. Decision-based attacks like Boundary Attack and HopSkipJump work with only top-1 labels. Removing scores increases the query cost but does not eliminate the threat.

Defenses Against Black-Box Attacks

Effective defenses against black-box attacks include:

  • Rate limiting to restrict query volume per user or IP
  • Query pattern detection to identify systematic exploration
  • Adding controlled noise to model outputs to frustrate gradient estimation
  • Using ensemble models on the defense side to reduce transfer vulnerability
  • Monitoring for anomalous query distributions that differ from normal usage

Summary

Black-box attacks are the most realistic threat model for production ML systems. Transfer attacks exploit cross-model vulnerabilities at zero query cost, while query-based attacks achieve higher success rates at the cost of more API calls. Understanding both strategies is essential for building effective defenses, particularly query monitoring and rate limiting. The next lesson examines why adversarial examples transfer between models.