Agent Conversation: anthropic/claude-opus-4.6
Model: anthropic/claude-opus-4.6
Total steps: 5
Messages: 11
UserStep 0
Task: security-adversarial-attack-black-box-score
Score-Based Query Black-Box Attack under Linf Constraint
Research Question
Can you design a stronger score-based query black-box attack that improves attack success rate (ASR) under a fixed query budget and L_inf perturbation constraint?
Objective
Implement a better black-box attack in bench/custom_attack.py:
- Threat model: query black-box (no gradient access).
- Constraint:
||x_adv - x||_inf <= eps. - Budget:
n_queriesis a per-sample query budget. - Primary metric: maximize
ASRunder fixed budget. - Tie-break: for similar ASR, lower
avg_queriesis better.
Editable Interface
You must implement:
run_attack(model, images, labels, eps, n_queries, device, n_classes) -> adv_images
Inputs:
model: black-box wrapper that returns logits only.images: tensor of shape(N, C, H, W), in[0, 1].labels: tensor of shape(N,).n_classes: 10 for CIFAR-10, 100 for CIFAR-100.
Output:
adv_images: tensor with same shape asimages, values in[0, 1].
Trusted Evaluation Logic
The evaluation logic in bench/run_eval.py is trusted and not editable.
- It tracks all model queries through a wrapper.
- If a batch exceeds query budget (
batch_size * n_queries), the entire batch is marked as attack failure. L_infand[0, 1]validity are checked per sample; only invalid samples are marked as attack failure.
Do not attempt to bypass wrapper behavior or evaluation logic. Only improve the attack algorithm in custom_attack.py.
Query Semantics
- One call to
model(x)consumesx.shape[0]queries. - Repeated calls on the same sample still consume additional queries.
- Different batch partitioning should be treated as equivalent total budget usage.
Performance Hint
The benchmark is query-heavy. Prefer batch-parallel score updates instead of pure per-sample loops when possible.
Evaluation Scenarios (6)
- ResNet20 on CIFAR-10
- VGG11-BN on CIFAR-10
- MobileNetV2 on CIFAR-10
- ResNet20 on CIFAR-100
- VGG11-BN on CIFAR-100
- MobileNetV2 on CIFAR-100
Reported metrics line format:
ATTACK_METRICS asr=... clean_acc=... robust_acc=... avg_queries=...
torchattacks/bench/custom_attack.py [EDITABLE — lines 7–56 only]
1: import torch
2: import torch.nn as nn
3:
4: # =====================================================================
5: # EDITABLE: implement run_attack below
6: # =====================================================================
7: def run_attack(
8: model: nn.Module,
9: images: torch.Tensor,
10: labels: torch.Tensor,
11: eps: float,
12: n_queries: int,
13: device: torch.device,
14: n_classes: int,
15: ) -> torch.Tensor:
16: """
17: Score-based query black-box attack under Linf constraint.
18:
19: Args:
20: model: black-box wrapper that only exposes forward logits.
21: AssistantStep 1edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "start_line": 7,
5 "end_line": 56,
6 "content": "def run_attack(\n model: nn.Module,\n images: torch.Tensor,\n labels: torch.Tensor,\n eps: float,\n n_queries: int,\n device: torch.device,\n n_classes: int,\n) -> torch.Tensor:\n model.eval()\n N, C, H, W = images.shape\n nq = max(1, int(n_queries)); budget = N * nq\n with torch.no_grad():\n def mloss(lg, y):\n c = lg.gather(1, y.view(-1, 1)).squeeze(1)\n t = lg.clone(); t.scatter_(1, y.view(-1, 1), -1e9)\n return c - t.max(1)[0]\n adv = torch.clamp(images + torch.sign(torch.randn_like(images)) * eps, 0., 1.)\n lg0 = model(adv); used = N\n best = mloss(lg0, labels); succ = lg0.argmax(1) != labels\n ge = torch.zeros_like(images)\n bl0 = mloss(model(images), labels); used += N\n for _ in range(min(6, max(1, nq // 25))):\n if used + N > budget: break\n u = torch.sign(torch.randn_like(images))\n ge += (mloss(model(torch.clamp(images+u*eps*0.1,0.,1.)),labels)-bl0).view(-1,1,1,1)*u; used += N\n if used + N <= budget:\n cd = torch.clamp(images - torch.sign(ge) * eps, 0., 1.)\n lgc = model(cd); used += N; cl = mloss(lgc, labels)\n imp = cl < best; adv = torch.where(imp.view(-1,1,1,1), cd, adv)\n best = torch.where(imp, cl, best); succ = succ | (lgc.argmax(1) != labels)\n p1q = used // N; sq_i = 0\n hc = torch.arange(H, device=device).view(1, 1, H, 1)\n wc = torch.arange(W, device=device).view(1, 1, 1, W)\n while True:\n idx = (~succ).nonzero(as_tuple=True)[0]; na = idx.numel()\n if na == 0 or used + na > budget: break\n sq_i += 1; pr = min((p1q + sq_i) / nq, 0.99)\n p = 0.8 if pr<0.1 else 0.5 if pr<0.5 else 0.2 if pr<0.8 else 0.1 if pr<0.95 else 0.05\n sz = max(1, int(round(p * min(H, W))))\n hs = torch.randint(0, max(1,H-sz+1), (na,1,1,1), device=device)\n ws = torch.randint(0, max(1,W-sz+1), (na,1,1,1), device=device)\n mk = ((hc >= hs) & (hc < hs+sz) & (wc >= ws) & (wc < ws+sz)).float()\n nv = torch.sign(torch.randn(na, C, H, W, device=device)) * eps\n ca = adv[idx]*(1-mk) + torch.clamp(images[idx]+nv, 0., 1.)*mk\n lga = model(ca); used += na\n cla = mloss(lga, labels[idx]); imp = cla < best[idx]; ui = idx[imp]\n if ui.numel() > 0: adv[ui] = ca[imp]; best[ui] = cla[imp]\n succ[idx] = succ[idx] | (lga.argmax(1) != labels[idx])\n return torch.clamp(images + torch.clamp(adv - images, -eps, eps), 0., 1.).detach()\n"
7}Tool ResultStep 1
OK: Replaced lines 7..56 with 50 line(s) in torchattacks/bench/custom_attack.py. Editable range: 7–56.
[Current file: torchattacks/bench/custom_attack.py | editable: 7–56 | total: 60 lines]
4| # =====================================================================
5| # EDITABLE: implement run_attack below
6| # =====================================================================
7 def run_attack(
8 model: nn.Module,
9 images: torch.Tensor,
10 labels: torch.Tensor,
11 eps: float,
12 n_queries: int,
13 device: torch.device,
14 n_classes: int,
15 ) -> torch.Tensor:
16 model.eval()
17 N, C, H, W = images.shape
18 nq = max(1, int(n_queries)); budget = N * nq
19 with torch.no_grad():
20 def mloss(lg, y):
21 c = lg.gather(1, y.view(-1, 1)).squeeze(1)
22 t = lg.clone(); t.scatter_(1, y.view(-1, 1), -1e9)
23
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=118.39
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9750, clean_acc=1.0000, robust_acc=0.0250, avg_queries=811.72
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=146.32
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=84.42
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=998.10
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'avg_queries_ResNet20_C10': 118.39, 'elapsed_ResNet20-C10': 15, 'asr_VGG11BN_C10': 0.975, 'avg_queries_VGG
AssistantStep 3edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "start_line": 7,
5 "end_line": 56,
6 "content": "def run_attack(\n model: nn.Module,\n images: torch.Tensor,\n labels: torch.Tensor,\n eps: float,\n n_queries: int,\n device: torch.device,\n n_classes: int,\n) -> torch.Tensor:\n model.eval(); N,C,H,W = images.shape; nq=max(1,int(n_queries)); budget=N*nq\n with torch.no_grad():\n def ml(lg,y):\n c=lg.gather(1,y.view(-1,1)).squeeze(1); t=lg.clone(); t.scatter_(1,y.view(-1,1),-1e9)\n return c-t.max(1)[0]\n adv=torch.clamp(images+torch.sign(torch.randn_like(images))*eps,0.,1.)\n lg0=model(adv); used=N; best=ml(lg0,labels); succ=lg0.argmax(1)!=labels\n ge=torch.zeros_like(images); bl0=ml(model(images),labels); used+=N\n for _ in range(min(4,max(1,nq//25))):\n if used+N>budget: break\n u=torch.sign(torch.randn_like(images))\n ge+=(ml(model(torch.clamp(images+u*eps*0.1,0.,1.)),labels)-bl0).view(-1,1,1,1)*u; used+=N\n if used+N<=budget:\n cd=torch.clamp(images-torch.sign(ge)*eps,0.,1.); lgc=model(cd); used+=N; cl=ml(lgc,labels)\n imp=cl<best; adv=torch.where(imp.view(-1,1,1,1),cd,adv); best=torch.where(imp,cl,best)\n succ=succ|(lgc.argmax(1)!=labels)\n p1q=used//N; sq_i=0; nim=torch.zeros(N,dtype=torch.long,device=device)\n hc=torch.arange(H,device=device).view(1,1,H,1); wc=torch.arange(W,device=device).view(1,1,1,W)\n while True:\n idx=(~succ).nonzero(as_tuple=True)[0]; na=idx.numel()\n if na==0 or used+na>budget: break\n ri=idx[nim[idx]>max(100,nq//10)]\n if ri.numel()>0 and used+ri.numel()<=budget:\n nr=ri.numel()\n adv[ri]=torch.clamp(images[ri]+torch.sign(torch.randn(nr,C,H,W,device=device))*eps,0.,1.)\n lgr=model(adv[ri]); used+=nr; best[ri]=ml(lgr,labels[ri]); nim[ri]=0\n succ[ri]=lgr.argmax(1)!=labels[ri]\n idx=(~succ).nonzero(as_tuple=True)[0]; na=idx.numel()\n if na==0 or used+na>budget: break\n sq_i+=1; pr=min((p1q+sq_i)/nq,0.99)\n p=0.8 if pr<0.1 else 0.5 if pr<0.5 else 0.2 if pr<0.8 else 0.1 if pr<0.95 else 0.05\n sz=max(1,int(round(p*min(H,W))))\n hs=torch.randint(0,max(1,H-sz+1),(na,1,1,1),device=device)\n ws=torch.randint(0,max(1,W-sz+1),(na,1,1,1),device=device)\n mk=((hc>=hs)&(hc<hs+sz)&(wc>=ws)&(wc<ws+sz)).float()\n nv=torch.sign(torch.randn(na,C,H,W,device=device))*eps\n ca=adv[idx]*(1-mk)+torch.clamp(images[idx]+nv,0.,1.)*mk\n lga=model(ca); used+=na; cla=ml(lga,labels[idx]); imp=cla<best[idx]; ui=idx[imp]\n if ui.numel()>0: adv[ui]=ca[imp]; best[ui]=cla[imp]\n nim[idx]+=1; nim[ui]=0; succ[idx]=succ[idx]|(lga.argmax(1)!=labels[idx])\n return torch.clamp(images+torch.clamp(adv-images,-eps,eps),0.,1.).detach()\n"
7}Tool ResultStep 3
OK: Replaced lines 7..56 with 50 line(s) in torchattacks/bench/custom_attack.py. Editable range: 7–56.
[Current file: torchattacks/bench/custom_attack.py | editable: 7–56 | total: 60 lines]
4| # =====================================================================
5| # EDITABLE: implement run_attack below
6| # =====================================================================
7 def run_attack(
8 model: nn.Module,
9 images: torch.Tensor,
10 labels: torch.Tensor,
11 eps: float,
12 n_queries: int,
13 device: torch.device,
14 n_classes: int,
15 ) -> torch.Tensor:
16 model.eval(); N,C,H,W = images.shape; nq=max(1,int(n_queries)); budget=N*nq
17 with torch.no_grad():
18 def ml(lg,y):
19 c=lg.gather(1,y.view(-1,1)).squeeze(1); t=lg.clone(); t.scatter_(1,y.view(-1,1),-1e9)
20 return c-t.max(1)[0]
21 adv=torch.clamp(images+t
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=114.16
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9650, clean_acc=1.0000, robust_acc=0.0350, avg_queries=903.08
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=149.28
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=91.74
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=961.33
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'avg_queries_ResNet20_C10': 114.16, 'elapsed_ResNet20-C10': 18, 'asr_VGG11BN_C10': 0.965, 'avg_queries_VGG1
AssistantStep 5submit
Tool Input
1{
2 "n": 1
3}Tool ResultStep 5
[submit] Submitting result from test #1 as final. ### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh) Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=118.39 ### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh) Attack results (VGG11BN-C10): ASR=0.9750, clean_acc=1.0000, robust_acc=0.0250, avg_queries=811.72 ### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh) Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=146.32 ### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh) Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=84.42 ### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh) Attack results (VGG11BN-C100): ASR=0.9800, clean_acc=1.0000, robust_acc=0.0200, avg_queries=998.10 ### MobileNetV2-C100 (scripts/eval_mobilenetv2_cifar100.sh) Attack results (MobileNetV2-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000, avg_queries=106.45 [Leaderboard] Res