Agent Conversation: anthropic/claude-opus-4.6
Model: anthropic/claude-opus-4.6
Total steps: 5
Messages: 11
UserStep 0
Task: security-adversarial-attack-white-box-linf
White-Box Evasion Attack under Linf Constraint
Objective
Implement a stronger white-box L_inf attack in bench/custom_attack.py.
Your method should maximize attack success rate (ASR) under a strict perturbation budget:
- Threat model: white-box (full model access, including gradients).
- Norm constraint:
||x_adv - x||_inf <= eps. - Budget:
eps = 4/255.
Editable Interface
You must implement:
run_attack(model, images, labels, eps, device, n_classes) -> adv_images
Inputs:
images: tensor of shape(N, C, H, W), values in[0, 1].labels: tensor of shape(N,).n_classes: 10 for CIFAR-10, 100 for CIFAR-100.
Output:
adv_images: same shape asimages, also in[0, 1].
Evaluation Protocol
Each evaluation script:
- Loads one pretrained model.
- Collects up to 1000 samples that are initially classified correctly.
- Runs your
run_attack. - Checks
L_infvalidity. - Reports:
clean_accrobust_accasr = 1 - robust_acc
Important:
- ASR denominator is the number of initially correct samples.
- Invalid adversarial outputs (shape mismatch or violated norm) are treated as failure.
Scenarios
Six scenarios are evaluated in parallel:
- ResNet20 on CIFAR-10
- VGG11-BN on CIFAR-10
- MobileNetV2 on CIFAR-10
- ResNet20 on CIFAR-100
- VGG11-BN on CIFAR-100
- MobileNetV2 on CIFAR-100
Baselines
fgsm: one-step FGSM baseline (simplest first-order attack).pgd: iterative PGD baseline (strong first-order baseline).mifgsm: momentum iterative FGSM.autoattack:torchattacks.AutoAttack(version="standard")as a strong upper baseline.
Your goal is to improve ASR while respecting the Linf budget.
torchattacks/bench/custom_attack.py [EDITABLE — lines 3–26 only]
1: import torch
2: import torch.nn as nn
3:
4: # =====================================================================
5: # EDITABLE: implement run_attack below
6: # =====================================================================
7: def run_attack(
8: model: nn.Module,
9: images: torch.Tensor,
10: labels: torch.Tensor,
11: eps: float,
12: device: torch.device,
13: n_classes: int,
14: ) -> torch.Tensor:
15: """
16: White-box L_inf adversarial attack.
17: images: (N, C, H, W) in [0, 1] on device. labels: (N,) on device.
18: n_classes: 10 for CIFAR-10, 100 for CIFAR-100.
19: Returns adv_images satisfying (adv_images - images).abs().max() <= eps.
20: """
21: _ = (model, labels, eps, device, n_classes)
22: return images.clone()
23:
24: # =====================================================================
25: # END EDITABLE REGION
26: # =====================================================================
fgsm baseline — editable region [READ-ONLY — reference implementation]
undefinedAssistantStep 1edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "start_line": 3,
5 "end_line": 26,
6 "content": "import torch.nn.functional as F\n\n# =====================================================================\n# MLT-WPGD: Multi-Loss Targeted PGD with Warm Restarts\n# Novel contribution: Combines diverse loss functions (CE, CW-margin, DLR,\n# targeted CE) across restarts to explore different gradient landscapes,\n# plus warm restarts that refine best-so-far adversarial examples.\n# =====================================================================\ndef run_attack(\n model: nn.Module,\n images: torch.Tensor,\n labels: torch.Tensor,\n eps: float,\n device: torch.device,\n n_classes: int,\n) -> torch.Tensor:\n model.eval()\n N = images.shape[0]\n best_adv = images.clone()\n best_margin = torch.full((N,), float('inf'), device=device)\n alpha, n_steps = eps / 4.0, 50\n def update_best(x):\n nonlocal best_adv, best_margin\n with torch.no_grad():\n lo = model(x)\n zy = lo.gather(1, labels.view(-1, 1)).squeeze(1)\n m = zy - lo.scatter(1, labels.view(-1, 1), -1e9).max(1)[0]\n b = m < best_margin\n best_adv[b] = x[b]; best_margin[b] = m[b]\n def pgd_run(x0, lfn):\n x = x0.detach().clone()\n mom = torch.zeros_like(x)\n for _ in range(n_steps):\n x.requires_grad_(True)\n g = torch.autograd.grad(lfn(model(x)).sum(), x)[0]\n g = g / (g.abs().mean(dim=[1, 2, 3], keepdim=True) + 1e-12)\n mom = mom + g\n x = x.detach() + alpha * mom.sign()\n x = torch.clamp(images + torch.clamp(x - images, -eps, eps), 0, 1).detach()\n return x\n ri = lambda: torch.clamp(images + torch.empty_like(images).uniform_(-eps, eps), 0, 1)\n wi = lambda: torch.clamp(images + torch.clamp(\n best_adv + torch.empty_like(images).uniform_(-eps * .3, eps * .3) - images,\n -eps, eps), 0, 1)\n # Loss functions: CE, CW-margin, DLR, targeted CE\n l_ce = lambda lo: F.cross_entropy(lo, labels, reduction='none')\n l_cw = lambda lo: (lo.scatter(1, labels.view(-1, 1), -1e9).max(1)[0]\n - lo.gather(1, labels.view(-1, 1)).squeeze(1))\n def l_dlr(lo):\n sl = lo.sort(1, descending=True)[0]\n return l_cw(lo) / (sl[:, 0] - sl[:, 2] + 1e-12)\n with torch.no_grad():\n tgt = model(images).scatter(1, labels.view(-1, 1), -1e9).argmax(1)\n l_tgt = lambda lo: -F.cross_entropy(lo, tgt, reduction='none')\n # Phase 1: diverse losses with random inits\n for lfn, init in [(l_ce, images.clone), (l_cw, ri), (l_dlr, ri), (l_tgt, ri)]:\n update_best(pgd_run(init(), lfn))\n # Phase 2: warm restarts refining best-so-far\n for lfn in [l_ce, l_cw]:\n update_best(pgd_run(wi(), lfn))\n return best_adv\n# =====================================================================\n# END EDITABLE REGION\n# =====================================================================\n"
7}Tool ResultStep 1
OK: Replaced lines 3..26 with 64 line(s) in torchattacks/bench/custom_attack.py. Editable range: 3–66.
[Current file: torchattacks/bench/custom_attack.py | editable: 3–66 | total: 66 lines]
1| import torch
2| import torch.nn as nn
3 import torch.nn.functional as F
4
5 # =====================================================================
6 # MLT-WPGD: Multi-Loss Targeted PGD with Warm Restarts
7 # Novel contribution: Combines diverse loss functions (CE, CW-margin, DLR,
8 # targeted CE) across restarts to explore different gradient landscapes,
9 # plus warm restarts that refine best-so-far adversarial examples.
10 # =====================================================================
11 def run_attack(
12 model: nn.Module,
13 images: torch.Tensor,
14 labels: torch.Tensor,
15 eps: float,
16 device: torch.device,
17 n_classes: int,
18 ) -> torch.Tensor:
19
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9590, clean_acc=1.0000, robust_acc=0.0410
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9580, clean_acc=1.0000, robust_acc=0.0420
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 18, 'asr_VGG11BN_C10': 0.959, 'elapsed_VGG11BN-C10': 18, 'asr_MobileNetV2_C10': 1.0, 'elapsed_MobileNetV2-C10': 35, 'asr_ResNet20_C100': 1.0, 'elapsed_ResNet20-C100': 35, 'asr_VGG11
AssistantStep 3edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "start_line": 3,
5 "end_line": 66,
6 "content": "import torch.nn.functional as F\n\n# =====================================================================\n# MLT-WPGD v2: Multi-Loss Targeted PGD with Warm Restarts\n# Novel: diverse loss landscape exploration via multi-loss + multi-target\n# restarts, warm refinement from best-so-far, decaying momentum.\n# =====================================================================\ndef run_attack(\n model: nn.Module,\n images: torch.Tensor,\n labels: torch.Tensor,\n eps: float,\n device: torch.device,\n n_classes: int,\n) -> torch.Tensor:\n model.eval()\n N = images.shape[0]\n best_adv = images.clone()\n best_margin = torch.full((N,), float('inf'), device=device)\n def update_best(x):\n nonlocal best_adv, best_margin\n with torch.no_grad():\n lo = model(x)\n zy = lo.gather(1, labels.view(-1, 1)).squeeze(1)\n m = zy - lo.scatter(1, labels.view(-1, 1), -1e9).max(1)[0]\n b = m < best_margin\n best_adv[b] = x[b]; best_margin[b] = m[b]\n def pgd_run(x0, lfn, alpha, steps, decay=0.75):\n x = x0.detach().clone()\n mom = torch.zeros_like(x)\n for _ in range(steps):\n x.requires_grad_(True)\n g = torch.autograd.grad(lfn(model(x)).sum(), x)[0]\n g = g / (g.abs().mean(dim=[1, 2, 3], keepdim=True) + 1e-12)\n mom = decay * mom + g\n x = x.detach() + alpha * mom.sign()\n x = torch.clamp(images + torch.clamp(x - images, -eps, eps), 0, 1).detach()\n return x\n ri = lambda: torch.clamp(images + torch.empty_like(images).uniform_(-eps, eps), 0, 1)\n wi = lambda s: torch.clamp(images + torch.clamp(\n best_adv + torch.empty_like(images).uniform_(-eps * s, eps * s) - images,\n -eps, eps), 0, 1)\n # Loss functions\n l_ce = lambda lo: F.cross_entropy(lo, labels, reduction='none')\n l_cw = lambda lo: (lo.scatter(1, labels.view(-1, 1), -1e9).max(1)[0]\n - lo.gather(1, labels.view(-1, 1)).squeeze(1))\n def l_dlr(lo):\n sl = lo.sort(1, descending=True)[0]\n return l_cw(lo) / (sl[:, 0] - sl[:, 2] + 1e-12)\n # Get top-2 target classes\n with torch.no_grad():\n lo0 = model(images).scatter(1, labels.view(-1, 1), -1e9)\n tgts = lo0.topk(2, dim=1)[1]\n l_t0 = lambda lo: -F.cross_entropy(lo, tgts[:, 0], reduction='none')\n l_t1 = lambda lo: -F.cross_entropy(lo, tgts[:, 1], reduction='none')\n a1, a2 = eps / 4.0, eps / 8.0\n # Phase 1: diverse untargeted + targeted attacks (large step)\n update_best(pgd_run(images.clone(), l_ce, a1, 50))\n update_best(pgd_run(ri(), l_cw, a1, 50))\n update_best(pgd_run(ri(), l_dlr, a1, 50))\n update_best(pgd_run(ri(), l_t0, a1, 50))\n update_best(pgd_run(ri(), l_t1, a1, 50))\n # Phase 2: warm restarts with large perturbation (explore)\n update_best(pgd_run(wi(0.5), l_ce, a1, 50))\n update_best(pgd_run(wi(0.5), l_cw, a1, 50))\n # Phase 3: warm restarts with small perturbation + small step (refine)\n update_best(pgd_run(wi(0.2), l_dlr, a2, 60))\n return best_adv\n# =====================================================================\n# END EDITABLE REGION\n# =====================================================================\n"
7}Tool ResultStep 3
OK: Replaced lines 3..66 with 71 line(s) in torchattacks/bench/custom_attack.py. Editable range: 3–73.
[Current file: torchattacks/bench/custom_attack.py | editable: 3–73 | total: 73 lines]
1| import torch
2| import torch.nn as nn
3 import torch.nn.functional as F
4
5 # =====================================================================
6 # MLT-WPGD v2: Multi-Loss Targeted PGD with Warm Restarts
7 # Novel: diverse loss landscape exploration via multi-loss + multi-target
8 # restarts, warm refinement from best-so-far, decaying momentum.
9 # =====================================================================
10 def run_attack(
11 model: nn.Module,
12 images: torch.Tensor,
13 labels: torch.Tensor,
14 eps: float,
15 device: torch.device,
16 n_classes: int,
17 ) -> torch.Tensor:
18 model.eval()
19 N = images.shape[0]
20 best_adv = images.clone(
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9630, clean_acc=1.0000, robust_acc=0.0370
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9610, clean_acc=1.0000, robust_acc=0.0390
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 24, 'asr_VGG11BN_C10': 0.963, 'elapsed_VGG11BN-C10': 24, 'asr_MobileNetV2_C10': 1.0, 'elapsed_MobileNetV2-C10': 49, 'asr_ResNet20_C100': 1.0, 'elapsed_ResNet20-C100': 49, 'asr_VGG11B
AssistantStep 5submit
Tool Input
1{
2 "n": 2
3}Tool ResultStep 5
[submit] Submitting result from test #2 as final.
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9630, clean_acc=1.0000, robust_acc=0.0370
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9610, clean_acc=1.0000, robust_acc=0.0390
### MobileNetV2-C100 (scripts/eval_mobilenetv2_cifar100.sh)
Attack results (MobileNetV2-C100): ASR=0.9990, clean_acc=1.0000, robust_acc=0.0010
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 24, 'asr_VGG11BN_C10': 0.963, 'elapsed_VGG11BN-C10': 24,