Agent Conversation: anthropic/claude-opus-4.6

Model: anthropic/claude-opus-4.6
Total steps: 5
Messages: 11
UserStep 0

Task: security-adversarial-attack-black-box-score

Score-Based Query Black-Box Attack under Linf Constraint

Research Question

Can you design a stronger score-based query black-box attack that improves attack success rate (ASR) under a fixed query budget and L_inf perturbation constraint?

Objective

Implement a better black-box attack in bench/custom_attack.py:

  • Threat model: query black-box (no gradient access).
  • Constraint: ||x_adv - x||_inf <= eps.
  • Budget: n_queries is a per-sample query budget.
  • Primary metric: maximize ASR under fixed budget.
  • Tie-break: for similar ASR, lower avg_queries is better.

Editable Interface

You must implement:

run_attack(model, images, labels, eps, n_queries, device, n_classes) -> adv_images

Inputs:

  • model: black-box wrapper that returns logits only.
  • images: tensor of shape (N, C, H, W), in [0, 1].
  • labels: tensor of shape (N,).
  • n_classes: 10 for CIFAR-10, 100 for CIFAR-100.

Output:

  • adv_images: tensor with same shape as images, values in [0, 1].

Trusted Evaluation Logic

The evaluation logic in bench/run_eval.py is trusted and not editable.

  • It tracks all model queries through a wrapper.
  • If a batch exceeds query budget (batch_size * n_queries), the entire batch is marked as attack failure.
  • L_inf and [0, 1] validity are checked per sample; only invalid samples are marked as attack failure.

Do not attempt to bypass wrapper behavior or evaluation logic. Only improve the attack algorithm in custom_attack.py.

Query Semantics

  • One call to model(x) consumes x.shape[0] queries.
  • Repeated calls on the same sample still consume additional queries.
  • Different batch partitioning should be treated as equivalent total budget usage.

Performance Hint

The benchmark is query-heavy. Prefer batch-parallel score updates instead of pure per-sample loops when possible.

Evaluation Scenarios (6)

  • ResNet20 on CIFAR-10
  • VGG11-BN on CIFAR-10
  • MobileNetV2 on CIFAR-10
  • ResNet20 on CIFAR-100
  • VGG11-BN on CIFAR-100
  • MobileNetV2 on CIFAR-100

Reported metrics line format:

ATTACK_METRICS asr=... clean_acc=... robust_acc=... avg_queries=...

torchattacks/bench/custom_attack.py [EDITABLE — lines 7–56 only]

     1: import torch
     2: import torch.nn as nn
     3: 
     4: # =====================================================================
     5: # EDITABLE: implement run_attack below
     6: # =====================================================================
     7: def run_attack(
     8:     model: nn.Module,
     9:     images: torch.Tensor,
    10:     labels: torch.Tensor,
    11:     eps: float,
    12:     n_queries: int,
    13:     device: torch.device,
    14:     n_classes: int,
    15: ) -> torch.Tensor:
    16:     """
    17:     Score-based query black-box attack under Linf constraint.
    18: 
    19:     Args:
    20:         model: black-box wrapper that only exposes forward logits.
    21:     
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "torchattacks/bench/custom_attack.py",
4  "start_line": 7,
5  "end_line": 56,
6  "content": "def run_attack(\n    model: nn.Module,\n    images: torch.Tensor,\n    labels: torch.Tensor,\n    eps: float,\n    n_queries: int,\n    device: torch.device,\n    n_classes: int,\n) -> torch.Tensor:\n    model.eval()\n    N, C, H, W = images.shape\n    nq = max(1, int(n_queries)); budget = N * nq\n    with torch.no_grad():\n        def mloss(lg, y):\n            c = lg.gather(1, y.view(-1, 1)).squeeze(1)\n            t = lg.clone(); t.scatter_(1, y.view(-1, 1), -1e9)\n            return c - t.max(1)[0]\n        adv = torch.clamp(images + torch.sign(torch.randn_like(images)) * eps, 0., 1.)\n        lg0 = model(adv); used = N\n        best = mloss(lg0, labels); succ = lg0.argmax(1) != labels\n        ge = torch.zeros_like(images)\n        bl0 = mloss(model(images), labels); used += N\n        for _ in range(min(6, max(1, nq // 25))):\n            if used + N > budget: break\n            u = torch.sign(torch.randn_like(images))\n            ge += (mloss(model(torch.clamp(images+u*eps*0.1,0.,1.)),labels)-bl0).view(-1,1,1,1)*u; used += N\n        if used + N <= budget:\n            cd = torch.clamp(images - torch.sign(ge) * eps, 0., 1.)\n            lgc = model(cd); used += N; cl = mloss(lgc, labels)\n            imp = cl < best; adv = torch.where(imp.view(-1,1,1,1), cd, adv)\n            best = torch.where(imp, cl, best); succ = succ | (lgc.argmax(1) != labels)\n        p1q = used // N; sq_i = 0\n        hc = torch.arange(H, device=device).view(1, 1, H, 1)\n        wc = torch.arange(W, device=device).view(1, 1, 1, W)\n        while True:\n            idx = (~succ).nonzero(as_tuple=True)[0]; na = idx.numel()\n            if na == 0 or used + na > budget: break\n            sq_i += 1; pr = min((p1q + sq_i) / nq, 0.99)\n            p = 0.8 if pr<0.1 else 0.5 if pr<0.5 else 0.2 if pr<0.8 else 0.1 if pr<0.95 else 0.05\n            sz = max(1, int(round(p * min(H, W))))\n            hs = torch.randint(0, max(1,H-sz+1), (na,1,1,1), device=device)\n            ws = torch.randint(0, max(1,W-sz+1), (na,1,1,1), device=device)\n            mk = ((hc >= hs) & (hc < hs+sz) & (wc >= ws) & (wc < ws+sz)).float()\n            nv = torch.sign(torch.randn(na, C, H, W, device=device)) * eps\n            ca = adv[idx]*(1-mk) + torch.clamp(images[idx]+nv, 0., 1.)*mk\n            lga = model(ca); used += na\n            cla = mloss(lga, labels[idx]); imp = cla < best[idx]; ui = idx[imp]\n            if ui.numel() > 0: adv[ui] = ca[imp]; best[ui] = cla[imp]\n            succ[idx] = succ[idx] | (lga.argmax(1) != labels[idx])\n        return torch.clamp(images + torch.clamp(adv - images, -eps, eps), 0., 1.).detach()\n"
7}
Tool ResultStep 1
OK: Replaced lines 7..56 with 50 line(s) in torchattacks/bench/custom_attack.py. Editable range: 7–56.

[Current file: torchattacks/bench/custom_attack.py | editable: 7–56 | total: 60 lines]
     4| # =====================================================================
     5| # EDITABLE: implement run_attack below
     6| # =====================================================================
     7  def run_attack(
     8      model: nn.Module,
     9      images: torch.Tensor,
    10      labels: torch.Tensor,
    11      eps: float,
    12      n_queries: int,
    13      device: torch.device,
    14      n_classes: int,
    15  ) -> torch.Tensor:
    16      model.eval()
    17      N, C, H, W = images.shape
    18      nq = max(1, int(n_queries)); budget = N * nq
    19      with torch.no_grad():
    20          def mloss(lg, y):
    21              c = lg.gather(1, y.view(-1, 1)).squeeze(1)
    22              t = lg.clone(); t.scatter_(1, y.view(-1, 1), -1e9)
    23           
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=118.39

### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9750, clean_acc=1.0000, robust_acc=0.0250, avg_queries=811.72

### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=146.32

### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=84.42

### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=998.10

[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'avg_queries_ResNet20_C10': 118.39, 'elapsed_ResNet20-C10': 15, 'asr_VGG11BN_C10': 0.975, 'avg_queries_VGG
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "torchattacks/bench/custom_attack.py",
4  "start_line": 7,
5  "end_line": 56,
6  "content": "def run_attack(\n    model: nn.Module,\n    images: torch.Tensor,\n    labels: torch.Tensor,\n    eps: float,\n    n_queries: int,\n    device: torch.device,\n    n_classes: int,\n) -> torch.Tensor:\n    model.eval(); N,C,H,W = images.shape; nq=max(1,int(n_queries)); budget=N*nq\n    with torch.no_grad():\n        def ml(lg,y):\n            c=lg.gather(1,y.view(-1,1)).squeeze(1); t=lg.clone(); t.scatter_(1,y.view(-1,1),-1e9)\n            return c-t.max(1)[0]\n        adv=torch.clamp(images+torch.sign(torch.randn_like(images))*eps,0.,1.)\n        lg0=model(adv); used=N; best=ml(lg0,labels); succ=lg0.argmax(1)!=labels\n        ge=torch.zeros_like(images); bl0=ml(model(images),labels); used+=N\n        for _ in range(min(4,max(1,nq//25))):\n            if used+N>budget: break\n            u=torch.sign(torch.randn_like(images))\n            ge+=(ml(model(torch.clamp(images+u*eps*0.1,0.,1.)),labels)-bl0).view(-1,1,1,1)*u; used+=N\n        if used+N<=budget:\n            cd=torch.clamp(images-torch.sign(ge)*eps,0.,1.); lgc=model(cd); used+=N; cl=ml(lgc,labels)\n            imp=cl<best; adv=torch.where(imp.view(-1,1,1,1),cd,adv); best=torch.where(imp,cl,best)\n            succ=succ|(lgc.argmax(1)!=labels)\n        p1q=used//N; sq_i=0; nim=torch.zeros(N,dtype=torch.long,device=device)\n        hc=torch.arange(H,device=device).view(1,1,H,1); wc=torch.arange(W,device=device).view(1,1,1,W)\n        while True:\n            idx=(~succ).nonzero(as_tuple=True)[0]; na=idx.numel()\n            if na==0 or used+na>budget: break\n            ri=idx[nim[idx]>max(100,nq//10)]\n            if ri.numel()>0 and used+ri.numel()<=budget:\n                nr=ri.numel()\n                adv[ri]=torch.clamp(images[ri]+torch.sign(torch.randn(nr,C,H,W,device=device))*eps,0.,1.)\n                lgr=model(adv[ri]); used+=nr; best[ri]=ml(lgr,labels[ri]); nim[ri]=0\n                succ[ri]=lgr.argmax(1)!=labels[ri]\n                idx=(~succ).nonzero(as_tuple=True)[0]; na=idx.numel()\n                if na==0 or used+na>budget: break\n            sq_i+=1; pr=min((p1q+sq_i)/nq,0.99)\n            p=0.8 if pr<0.1 else 0.5 if pr<0.5 else 0.2 if pr<0.8 else 0.1 if pr<0.95 else 0.05\n            sz=max(1,int(round(p*min(H,W))))\n            hs=torch.randint(0,max(1,H-sz+1),(na,1,1,1),device=device)\n            ws=torch.randint(0,max(1,W-sz+1),(na,1,1,1),device=device)\n            mk=((hc>=hs)&(hc<hs+sz)&(wc>=ws)&(wc<ws+sz)).float()\n            nv=torch.sign(torch.randn(na,C,H,W,device=device))*eps\n            ca=adv[idx]*(1-mk)+torch.clamp(images[idx]+nv,0.,1.)*mk\n            lga=model(ca); used+=na; cla=ml(lga,labels[idx]); imp=cla<best[idx]; ui=idx[imp]\n            if ui.numel()>0: adv[ui]=ca[imp]; best[ui]=cla[imp]\n            nim[idx]+=1; nim[ui]=0; succ[idx]=succ[idx]|(lga.argmax(1)!=labels[idx])\n        return torch.clamp(images+torch.clamp(adv-images,-eps,eps),0.,1.).detach()\n"
7}
Tool ResultStep 3
OK: Replaced lines 7..56 with 50 line(s) in torchattacks/bench/custom_attack.py. Editable range: 7–56.

[Current file: torchattacks/bench/custom_attack.py | editable: 7–56 | total: 60 lines]
     4| # =====================================================================
     5| # EDITABLE: implement run_attack below
     6| # =====================================================================
     7  def run_attack(
     8      model: nn.Module,
     9      images: torch.Tensor,
    10      labels: torch.Tensor,
    11      eps: float,
    12      n_queries: int,
    13      device: torch.device,
    14      n_classes: int,
    15  ) -> torch.Tensor:
    16      model.eval(); N,C,H,W = images.shape; nq=max(1,int(n_queries)); budget=N*nq
    17      with torch.no_grad():
    18          def ml(lg,y):
    19              c=lg.gather(1,y.view(-1,1)).squeeze(1); t=lg.clone(); t.scatter_(1,y.view(-1,1),-1e9)
    20              return c-t.max(1)[0]
    21          adv=torch.clamp(images+t
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=114.16

### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9650, clean_acc=1.0000, robust_acc=0.0350, avg_queries=903.08

### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=149.28

### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=91.74

### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=961.33

[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'avg_queries_ResNet20_C10': 114.16, 'elapsed_ResNet20-C10': 18, 'asr_VGG11BN_C10': 0.965, 'avg_queries_VGG1
AssistantStep 5submit
Tool Input
1{
2  "n": 1
3}
Tool ResultStep 5
[submit] Submitting result from test #1 as final.

### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=118.39

### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9750, clean_acc=1.0000, robust_acc=0.0250, avg_queries=811.72

### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=146.32

### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=84.42

### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=998.10

### MobileNetV2-C100 (scripts/eval_mobilenetv2_cifar100.sh)
Attack results (MobileNetV2-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=106.45

[Leaderboard] Res