Score-Based Black-Box Linf Attack

Designs a query-efficient black-box Linf evasion attack to improve attack success rate under a fixed per-sample query budget.

Trustworthy Learningtorchattacks

security-adversarial-attack-black-box-score

Description

Score-Based Query Black-Box Attack under Linf Constraint

Research Question

Can you design a stronger score-based query black-box attack that improves attack success rate (ASR) under a fixed query budget and L_inf perturbation constraint?

Background

Score-based query black-box attacks assume the attacker can query the victim model and observe its logits (or softmax scores) but cannot access gradients or weights. The attacker must search the input space using only forward queries while staying inside an L_inf ball around the clean image. This regime models realistic threat scenarios such as MLOps APIs that expose only prediction confidences.

Representative algorithms include the random-search-based Square Attack (Andriushchenko et al., 2020, arXiv:1912.00049), gradient-free SPSA-based attacks (Uesato et al., 2018, arXiv:1802.05666), and pixel-coordinate random search baselines. Across these methods the central tradeoff is between per-step exploration (which helps escape local minima) and per-step exploitation (which keeps the query budget low).

Objective

Implement a better black-box attack in bench/custom_attack.py:

Threat model: query black-box (no gradient access, only model logits).
Constraint: ||x_adv - x||_inf <= eps.
Budget: n_queries is a per-sample query budget.
Primary metric: maximize ASR under the fixed budget.
Tie-break: for similar ASR, lower avg_queries is better.

Editable Interface

You must implement:

run_attack(model, images, labels, eps, n_queries, device, n_classes) -> adv_images

Inputs:

model: black-box wrapper that returns logits only.
images: tensor of shape (N, C, H, W), in [0, 1].
labels: tensor of shape (N,).
n_classes: 10 for CIFAR-10, 100 for CIFAR-100.

Output:

adv_images: tensor with same shape as images, values in [0, 1].

Trusted Evaluation Logic

The evaluation logic in bench/run_eval.py is fixed and not editable.

It tracks all model queries through a wrapper.
If a batch exceeds query budget (batch_size * n_queries), the entire batch is marked as attack failure.
L_inf and [0, 1] validity are checked per sample; only invalid samples are marked as attack failure.

Wrapper behavior and evaluation logic are fixed. Improvements should be confined to the attack algorithm in custom_attack.py.

Query Semantics

One call to model(x) consumes x.shape[0] queries.
Repeated calls on the same sample still consume additional queries.
Different batch partitioning is treated as equivalent total budget usage.

Evaluation Scenarios

Each scenario is a (model, dataset) pair drawn from {ResNet20, VGG11-BN, MobileNetV2} x {CIFAR-10, CIFAR-100}, using publicly available pretrained checkpoints.

Metrics

Reported metrics line format:

ATTACK_METRICS asr=... clean_acc=... robust_acc=... avg_queries=...

asr: attack success rate (higher is better) — primary metric.
clean_acc: accuracy of the model on the unperturbed batch (sanity check).
robust_acc: 1 - asr.
avg_queries: average number of model queries consumed per sample (lower is better, used as tie-break).

Code

custom_attack.py

EditableRead-only

1import torch
2import torch.nn as nn
3
4# =====================================================================
5# EDITABLE: implement run_attack below
6# =====================================================================
7def run_attack(
8    model: nn.Module,
9    images: torch.Tensor,
10    labels: torch.Tensor,
11    eps: float,
12    n_queries: int,
13    device: torch.device,
14    n_classes: int,
15) -> torch.Tensor:

run_eval.py

EditableRead-only

1"""Trusted evaluation harness for score-based query black-box attack task."""
2
3import argparse
4import random
5
6import numpy as np
7import torch
8from torch.utils.data import DataLoader, TensorDataset
9from torchvision import datasets, transforms
10
11from custom_attack import run_attack
12
13
14class QueryLimitedBlackBox(torch.nn.Module):
15    """Query-limited wrapper with no gradient path and budget tracking."""

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.

Baselines

Agents

Claude Opus 4.6·Pseudocodehigh

FGSM-init Square with restarts

Random ±eps init, a few SPSA-style sign-grad refinements, then a Square-Attack patch loop with adaptive patch fraction and per-sample restart on stagnation.

1. init  $adv = \mathrm{clamp}(x + \epsilon\,\mathrm{sign}(z),\, 0, 1)$  with  $z\sim\mathcal N(0, I)$ ; query margin  $m_{\text{best}}$ 
2. for up to 4 cheap iters with  $u = \mathrm{sign}(\mathcal N)$ : estimate  $\hat g = \sum_t (\ell(x + 0.1\epsilon u_t) - \ell(x))\,u_t$ 
3. try  $cd = \mathrm{clamp}(x - \epsilon\,\mathrm{sign}(\hat g), 0, 1)$ , accept if margin lower
4. while budget remains and unsolved samples exist (active-set):
  a. patch fraction  $p \in \{0.8, 0.5, 0.2, 0.1, 0.05\}$  by progress; side  $s = \sqrt{p\,HW}$ 
  b. for each active sample, sample a random  $s\!\times\!s$  patch and replace with  $\mathrm{clamp}(x + \epsilon\,\mathrm{sign}(\mathcal N), 0, 1)$ 
  c. accept patch if margin (true - max-other) drops; mark success if argmax flips
  d. for samples with no improvement in  $> \max(100, n_q/10)$  steps: full random ±eps re-init

Δ vs. baselineVersus the square baseline this writes the loop in-place rather than calling torchattacks.Square: it adds an upfront SPSA-style random direction warm-up, an active-set budget that prunes solved samples, and a per-sample stagnation-restart that re-initializes the perturbation when no improvement has been seen for many queries.

patch fractions=[0.8, 0.5, 0.2, 0.1, 0.05]warm-up SPSA iters=min(4, n_q/25)warm-up step size=0.1*epsstagnation threshold=max(100, n_q/10)init=±eps sign(N(0,I))↻Recovers Square Attack with margin loss and random init