Agent Conversation: gpt-5.4-pro
Model: gpt-5.4-pro
Total steps: 5
Messages: 11
UserStep 0
Task: security-adversarial-attack-white-box-linf
White-Box Evasion Attack under Linf Constraint
Objective
Implement a stronger white-box L_inf attack in bench/custom_attack.py.
Your method should maximize attack success rate (ASR) under a strict perturbation budget:
- Threat model: white-box (full model access, including gradients).
- Norm constraint:
||x_adv - x||_inf <= eps. - Budget:
eps = 4/255.
Editable Interface
You must implement:
run_attack(model, images, labels, eps, device, n_classes) -> adv_images
Inputs:
images: tensor of shape(N, C, H, W), values in[0, 1].labels: tensor of shape(N,).n_classes: 10 for CIFAR-10, 100 for CIFAR-100.
Output:
adv_images: same shape asimages, also in[0, 1].
Evaluation Protocol
Each evaluation script:
- Loads one pretrained model.
- Collects up to 1000 samples that are initially classified correctly.
- Runs your
run_attack. - Checks
L_infvalidity. - Reports:
clean_accrobust_accasr = 1 - robust_acc
Important:
- ASR denominator is the number of initially correct samples.
- Invalid adversarial outputs (shape mismatch or violated norm) are treated as failure.
Scenarios
Six scenarios are evaluated in parallel:
- ResNet20 on CIFAR-10
- VGG11-BN on CIFAR-10
- MobileNetV2 on CIFAR-10
- ResNet20 on CIFAR-100
- VGG11-BN on CIFAR-100
- MobileNetV2 on CIFAR-100
Baselines
fgsm: one-step FGSM baseline (simplest first-order attack).pgd: iterative PGD baseline (strong first-order baseline).mifgsm: momentum iterative FGSM.autoattack:torchattacks.AutoAttack(version="standard")as a strong upper baseline.
Your goal is to improve ASR while respecting the Linf budget.
torchattacks/bench/custom_attack.py [EDITABLE — lines 3–26 only]
1: import torch
2: import torch.nn as nn
3:
4: # =====================================================================
5: # EDITABLE: implement run_attack below
6: # =====================================================================
7: def run_attack(
8: model: nn.Module,
9: images: torch.Tensor,
10: labels: torch.Tensor,
11: eps: float,
12: device: torch.device,
13: n_classes: int,
14: ) -> torch.Tensor:
15: """
16: White-box L_inf adversarial attack.
17: images: (N, C, H, W) in [0, 1] on device. labels: (N,) on device.
18: n_classes: 10 for CIFAR-10, 100 for CIFAR-100.
19: Returns adv_images satisfying (adv_images - images).abs().max() <= eps.
20: """
21: _ = (model, labels, eps, device, n_classes)
22: return images.clone()
23:
24: # =====================================================================
25: # END EDITABLE REGION
26: # =====================================================================
fgsm baseline — editable region [READ-ONLY — reference implementation]
undefinedAssistantStep 1edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "content": "\n# =====================================================================\n# EDITABLE: implement run_attack below\n# =====================================================================\ndef run_attack(\n model: nn.Module,\n images: torch.Tensor,\n labels: torch.Tensor,\n eps: float,\n device: torch.device,\n n_classes: int,\n) -> torch.Tensor:\n import torch.nn.functional as F\n\n _ = device\n model.eval()\n x = images.detach()\n\n def proj(x_adv: torch.Tensor, x_ref: torch.Tensor) -> torch.Tensor:\n return torch.clamp(torch.max(torch.min(x_adv, x_ref + eps), x_ref - eps), 0.0, 1.0)\n\n def untarget_score(logits: torch.Tensor, y: torch.Tensor) -> torch.Tensor:\n y_logit = logits.gather(1, y[:, None]).squeeze(1)\n vals, idx = logits.sort(dim=1, descending=True)\n other = torch.where(idx[:, 0].eq(y), vals[:, 1], vals[:, 0])\n den_idx = min(3, logits.size(1) - 1)\n denom = (vals[:, 0] - vals[:, den_idx]).clamp_min(1e-12)\n return (other - y_logit) / denom + 0.05 * (other - y_logit)\n\n with torch.enable_grad():\n x0 = x.clone().requires_grad_(True)\n g0 = torch.autograd.grad(F.cross_entropy(model(x0), labels), x0)[0]\n x_adv = proj(x + 0.5 * eps * g0.sign() + 0.5 * torch.empty_like(x).uniform_(-eps, eps), x).detach()\n best_x = x_adv.clone()\n best_score = torch.full((x.size(0),), -1e9, device=x.device)\n mom = torch.zeros_like(x)\n prev_g = torch.zeros_like(x)\n alpha = torch.full((x.size(0), 1, 1, 1), eps / 3.0, device=x.device)\n ref_score = best_score.clone()\n\n for t in range(24):\n x_adv.requires_grad_(True)\n logits = model(x_adv)\n ce = F.cross_entropy(logits, labels, reduction=\"none\")\n score = untarget_score(logits, labels)\n obj = ((1.0 - (t + 1) / 24.0) * ce + ((t + 1) / 24.0) * score).sum()\n grad = torch.autograd.grad(obj, x_adv)[0]\n grad = grad / (grad.abs().mean(dim=(1, 2, 3), keepdim=True) + 1e-12)\n mom = 0.8 * mom + grad + 0.2 * (grad - prev_g)\n prev_g = grad.detach()\n with torch.no_grad():\n pred = logits.argmax(dim=1)\n cur_score = (score + 0.1 * ce).detach()\n succ = pred.ne(labels)\n cur_score = torch.where(succ, cur_score + 1e6, cur_score)\n better = cur_score > best_score\n best_score = torch.where(better, cur_score, best_score)\n best_x[better] = x_adv.detach()[better]\n x_adv = proj(x_adv.detach() + alpha * mom.sign(), x)\n if (t + 1) % 6 == 0 and t < 23:\n stalled = best_score <= (ref_score + 1e-4)\n alpha[stalled] = alpha[stalled] * 0.5\n x_adv[stalled] = best_x[stalled]\n mom[stalled] = 0\n prev_g[stalled] = 0\n ref_score = best_score.clone()\n\n with torch.no_grad():\n logits = model(x_adv)\n ce = F.cross_entropy(logits, labels, reduction=\"none\")\n score = untarget_score(logits, labels)\n cur_score = score + 0.1 * ce\n succ = logits.argmax(dim=1).ne(labels)\n cur_score = torch.where(succ, cur_score + 1e6, cur_score)\n better = cur_score > best_score\n best_x[better] = x_adv[better]\n best_score = torch.where(better, cur_score, best_score)\n logits_best = model(best_x)\n robust = logits_best.argmax(dim=1).eq(labels)\n\n if robust.any():\n xs, x_ref, ys = best_x[robust].clone(), x[robust], labels[robust]\n logits_s = logits_best[robust]\n xs_score = untarget_score(logits_s, ys)\n order = logits_s.argsort(dim=1, descending=True)\n targets = order[order.ne(ys[:, None])].view(xs.size(0), -1)[:, : (3 if n_classes <= 10 else 4)]\n for j in range(targets.size(1)):\n tgt = targets[:, j]\n x_t = proj(xs + torch.empty_like(xs).uniform_(-0.25 * eps, 0.25 * eps), x_ref).detach()\n mom = torch.zeros_like(xs)\n prev_g = torch.zeros_like(xs)\n alpha_t = torch.full((xs.size(0), 1, 1, 1), eps / 4.0, device=x.device)\n local_best = x_t.clone()\n local_score = torch.full((xs.size(0),), -1e9, device=x.device)\n ref_local = local_score.clone()\n for s in range(10):\n x_t.requires_grad_(True)\n logits_t = model(x_t)\n y_logit = logits_t.gather(1, ys[:, None]).squeeze(1)\n t_logit = logits_t.gather(1, tgt[:, None]).squeeze(1)\n obj = (t_logit - y_logit - 0.25 * F.cross_entropy(logits_t, tgt, reduction=\"none\")).sum()\n grad = torch.autograd.grad(obj, x_t)[0]\n grad = grad / (grad.abs().mean(dim=(1, 2, 3), keepdim=True) + 1e-12)\n mom = 0.8 * mom + grad + 0.2 * (grad - prev_g)\n prev_g = grad.detach()\n with torch.no_grad():\n score_t = (t_logit - y_logit).detach()\n succ_t = logits_t.argmax(dim=1).ne(ys)\n score_t = torch.where(succ_t, score_t + 1e6, score_t)\n better_t = score_t > local_score\n local_score = torch.where(better_t, score_t, local_score)\n local_best[better_t] = x_t.detach()[better_t]\n x_t = proj(x_t.detach() + alpha_t * mom.sign(), x_ref)\n if (s + 1) % 5 == 0 and s < 9:\n stalled = local_score <= (ref_local + 1e-4)\n alpha_t[stalled] = alpha_t[stalled] * 0.5\n x_t[stalled] = local_best[stalled]\n mom[stalled] = 0\n prev_g[stalled] = 0\n ref_local = local_score.clone()\n with torch.no_grad():\n logits_l = model(local_best)\n score_l = untarget_score(logits_l, ys)\n succ_l = logits_l.argmax(dim=1).ne(ys)\n score_l = torch.where(succ_l, score_l + 1e6, score_l)\n better_l = score_l > xs_score\n xs_score = torch.where(better_l, score_l, xs_score)\n xs[better_l] = local_best[better_l]\n best_x[robust] = xs\n\n return best_x.detach()\n\n# =====================================================================\n# END EDITABLE REGION\n# =====================================================================",
5 "after_line": 0,
6 "start_line": 3,
7 "end_line": 26
8}Tool ResultStep 1
OK: Replaced lines 3..26 with 135 line(s) in torchattacks/bench/custom_attack.py. Editable range: 3–137.
[Current file: torchattacks/bench/custom_attack.py | editable: 3–137 | total: 137 lines]
1| import torch
2| import torch.nn as nn
3
4 # =====================================================================
5 # EDITABLE: implement run_attack below
6 # =====================================================================
7 def run_attack(
8 model: nn.Module,
9 images: torch.Tensor,
10 labels: torch.Tensor,
11 eps: float,
12 device: torch.device,
13 n_classes: int,
14 ) -> torch.Tensor:
15 import torch.nn.functional as F
16
17 _ = device
18 model.eval()
19 x = images.detach()
20
21 def proj(x_adv: torch.Tensor, x_ref: torch.Tensor) -> torch.Tensor:
22 return torch.clamp(torch.max(torch.min(x_adv, x_ref + eps), x_ref
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9520, clean_acc=1.0000, robust_acc=0.0480
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9540, clean_acc=1.0000, robust_acc=0.0460
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 8, 'asr_VGG11BN_C10': 0.952, 'elapsed_VGG11BN-C10': 9, 'asr_MobileNetV2_C10': 1.0, 'elapsed_MobileNetV2-C10': 10, 'asr_ResNet20_C100': 1.0, 'elapsed_ResNet20-C100': 10, 'asr_VGG11BN
AssistantStep 3edit
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_attack.py",
4 "content": " if robust.any():\n xs, x_ref, ys = best_x[robust].clone(), x[robust], labels[robust]\n logits_s = logits_best[robust]\n xs_score = untarget_score(logits_s, ys) + 0.1 * F.cross_entropy(logits_s, ys, reduction=\"none\")\n top_m = 5 if n_classes <= 10 else 7\n order = logits_s.argsort(dim=1, descending=True)\n targets = order[order.ne(ys[:, None])].view(xs.size(0), -1)[:, :top_m]\n\n with torch.no_grad():\n base_t = logits_s.gather(1, targets)\n weights = torch.softmax(base_t / 2.0, dim=1)\n\n x_t = proj(xs + torch.empty_like(xs).uniform_(-0.35 * eps, 0.35 * eps), x_ref).detach()\n mom = torch.zeros_like(xs)\n prev_g = torch.zeros_like(xs)\n alpha_t = torch.full((xs.size(0), 1, 1, 1), eps / 3.0, device=x.device)\n local_best = xs.clone()\n local_score = xs_score.clone()\n ref_local = local_score.clone()\n tau = 0.75\n\n for s in range(12):\n x_t.requires_grad_(True)\n logits_t = model(x_t)\n y_logit = logits_t.gather(1, ys[:, None]).squeeze(1)\n tgt_logits = logits_t.gather(1, targets)\n consensus = 0.5 * (weights * tgt_logits).sum(dim=1) + 0.5 * tau * torch.logsumexp(tgt_logits / tau, dim=1)\n ce_y = F.cross_entropy(logits_t, ys, reduction=\"none\")\n grad = torch.autograd.grad((consensus - y_logit + 0.2 * ce_y).sum(), x_t)[0]\n grad = grad / (grad.abs().mean(dim=(1, 2, 3), keepdim=True) + 1e-12)\n mom = 0.8 * mom + grad + 0.2 * (grad - prev_g)\n prev_g = grad.detach()\n with torch.no_grad():\n score_t = untarget_score(logits_t, ys) + 0.1 * ce_y + 0.1 * (consensus - y_logit)\n succ_t = logits_t.argmax(dim=1).ne(ys)\n score_t = torch.where(succ_t, score_t + 1e6, score_t)\n better_t = score_t > local_score\n local_score = torch.where(better_t, score_t, local_score)\n local_best[better_t] = x_t.detach()[better_t]\n x_t = proj(x_t.detach() + alpha_t * (mom + 0.25 * grad).sign(), x_ref)\n if (s + 1) % 4 == 0 and s < 11:\n snap = proj(x_ref + eps * (mom + grad).sign(), x_ref)\n logits_snap = model(snap)\n snap_cons = 0.5 * (weights * logits_snap.gather(1, targets)).sum(dim=1) + 0.5 * tau * torch.logsumexp(logits_snap.gather(1, targets) / tau, dim=1)\n snap_score = untarget_score(logits_snap, ys) + 0.1 * F.cross_entropy(logits_snap, ys, reduction=\"none\") + 0.1 * (snap_cons - logits_snap.gather(1, ys[:, None]).squeeze(1))\n snap_succ = logits_snap.argmax(dim=1).ne(ys)\n snap_score = torch.where(snap_succ, snap_score + 1e6, snap_score)\n better_snap = snap_score > local_score\n local_score = torch.where(better_snap, snap_score, local_score)\n local_best[better_snap] = snap[better_snap]\n stalled = local_score <= (ref_local + 1e-4)\n alpha_t[stalled] = alpha_t[stalled] * 0.5\n x_t[stalled] = snap[stalled]\n mom[stalled] = 0\n prev_g[stalled] = 0\n ref_local = local_score.clone()\n\n with torch.no_grad():\n xs_better = local_score > xs_score\n xs_score = torch.where(xs_better, local_score, xs_score)\n xs[xs_better] = local_best[xs_better]\n logits_r = model(xs)\n rem = logits_r.argmax(dim=1).eq(ys)\n\n if rem.any():\n xs_r, x_ref_r, ys_r = xs[rem].clone(), x_ref[rem], ys[rem]\n logits_r = logits_r[rem]\n score_r = untarget_score(logits_r, ys_r) + 0.1 * F.cross_entropy(logits_r, ys_r, reduction=\"none\")\n order_r = logits_r.argsort(dim=1, descending=True)\n targets_r = order_r[order_r.ne(ys_r[:, None])].view(xs_r.size(0), -1)[:, : (4 if n_classes <= 10 else 5)]\n for j in range(targets_r.size(1)):\n tgt = targets_r[:, j]\n x_p = proj(xs_r + torch.empty_like(xs_r).uniform_(-0.2 * eps, 0.2 * eps), x_ref_r).detach()\n mom = torch.zeros_like(xs_r)\n prev_g = torch.zeros_like(xs_r)\n alpha_p = torch.full((xs_r.size(0), 1, 1, 1), eps / 4.0, device=x.device)\n local_best = xs_r.clone()\n local_score = score_r.clone()\n ref_local = local_score.clone()\n for s in range(8):\n x_p.requires_grad_(True)\n logits_p = model(x_p)\n y_logit = logits_p.gather(1, ys_r[:, None]).squeeze(1)\n t_logit = logits_p.gather(1, tgt[:, None]).squeeze(1)\n ce_y = F.cross_entropy(logits_p, ys_r, reduction=\"none\")\n ce_t = F.cross_entropy(logits_p, tgt, reduction=\"none\")\n grad = torch.autograd.grad((t_logit - y_logit + 0.2 * ce_y - 0.4 * ce_t).sum(), x_p)[0]\n grad = grad / (grad.abs().mean(dim=(1, 2, 3), keepdim=True) + 1e-12)\n mom = 0.75 * mom + grad + 0.25 * (grad - prev_g)\n prev_g = grad.detach()\n with torch.no_grad():\n score_p = untarget_score(logits_p, ys_r) + 0.1 * ce_y + 0.1 * (t_logit - y_logit)\n succ_p = logits_p.argmax(dim=1).ne(ys_r)\n score_p = torch.where(succ_p, score_p + 1e6, score_p)\n better_p = score_p > local_score\n local_score = torch.where(better_p, score_p, local_score)\n local_best[better_p] = x_p.detach()[better_p]\n x_p = proj(x_p.detach() + alpha_p * (mom + 0.25 * grad).sign(), x_ref_r)\n if (s + 1) % 4 == 0 and s < 7:\n snap = proj(x_ref_r + eps * (mom + grad).sign(), x_ref_r)\n logits_snap = model(snap)\n y_snap = logits_snap.gather(1, ys_r[:, None]).squeeze(1)\n t_snap = logits_snap.gather(1, tgt[:, None]).squeeze(1)\n snap_score = untarget_score(logits_snap, ys_r) + 0.1 * F.cross_entropy(logits_snap, ys_r, reduction=\"none\") + 0.1 * (t_snap - y_snap)\n snap_succ = logits_snap.argmax(dim=1).ne(ys_r)\n snap_score = torch.where(snap_succ, snap_score + 1e6, snap_score)\n better_snap = snap_score > local_score\n local_score = torch.where(better_snap, snap_score, local_score)\n local_best[better_snap] = snap[better_snap]\n stalled = local_score <= (ref_local + 1e-4)\n alpha_p[stalled] = alpha_p[stalled] * 0.5\n x_p[stalled] = snap[stalled]\n mom[stalled] = 0\n prev_g[stalled] = 0\n ref_local = local_score.clone()\n with torch.no_grad():\n better_r = local_score > score_r\n score_r = torch.where(better_r, local_score, score_r)\n xs_r[better_r] = local_best[better_r]\n xs[rem] = xs_r\n\n best_x[robust] = xs",
5 "after_line": 82,
6 "start_line": 83,
7 "end_line": 131
8}Tool ResultStep 3
OK: Replaced lines 83..131 with 122 line(s) in torchattacks/bench/custom_attack.py. Editable range: 3–210.
[Current file: torchattacks/bench/custom_attack.py | editable: 3–210 | total: 210 lines]
1| import torch
2| import torch.nn as nn
3
4 # =====================================================================
5 # EDITABLE: implement run_attack below
6 # =====================================================================
7 def run_attack(
8 model: nn.Module,
9 images: torch.Tensor,
10 labels: torch.Tensor,
11 eps: float,
12 device: torch.device,
13 n_classes: int,
14 ) -> torch.Tensor:
15 import torch.nn.functional as F
16
17 _ = device
18 model.eval()
19 x = images.detach()
20
21 def proj(x_adv: torch.Tensor, x_ref: torch.Tensor) -> torch.Tensor:
22 return torch.clamp(torch.max(torch.min(x_adv, x_ref + eps), x_r
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9510, clean_acc=1.0000, robust_acc=0.0490
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9520, clean_acc=1.0000, robust_acc=0.0480
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 8, 'asr_VGG11BN_C10': 0.951, 'elapsed_VGG11BN-C10': 10, 'asr_MobileNetV2_C10': 1.0, 'elapsed_MobileNetV2-C10': 10, 'asr_ResNet20_C100': 1.0, 'elapsed_ResNet20-C100': 10, 'asr_VGG11BN
AssistantStep 5submit
Tool Input
1{
2 "n": 1
3}Tool ResultStep 5
[submit] Submitting result from test #1 as final.
### ResNet20-C10 (scripts/eval_resnet20_cifar10.sh)
Attack results (ResNet20-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C10 (scripts/eval_vgg11bn_cifar10.sh)
Attack results (VGG11BN-C10): ASR=0.9520, clean_acc=1.0000, robust_acc=0.0480
### MobileNetV2-C10 (scripts/eval_mobilenetv2_cifar10.sh)
Attack results (MobileNetV2-C10): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### ResNet20-C100 (scripts/eval_resnet20_cifar100.sh)
Attack results (ResNet20-C100): ASR=1.0000, clean_acc=1.0000, robust_acc=0.0000
### VGG11BN-C100 (scripts/eval_vgg11bn_cifar100.sh)
Attack results (VGG11BN-C100): ASR=0.9540, clean_acc=1.0000, robust_acc=0.0460
### MobileNetV2-C100 (scripts/eval_mobilenetv2_cifar100.sh)
Attack results (MobileNetV2-C100): ASR=0.9990, clean_acc=1.0000, robust_acc=0.0010
[Leaderboard] Results saved: {'asr_ResNet20_C10': 1.0, 'elapsed_ResNet20-C10': 8, 'asr_VGG11BN_C10': 0.952, 'elapsed_VGG11BN-C10': 9, 'a