Agent Conversation: anthropic/claude-opus-4.6
Task: security-adversarial-training
Adversarial Training for Model Robustness
Research Question
How to design better adversarial training methods to enhance model robustness against L_inf adversarial attacks?
Background
Adversarial training is the most effective approach for improving neural network robustness against adversarial examples. The standard method (Madry et al., 2018) trains on PGD-generated adversarial examples using cross-entropy loss, but suffers from a trade-off between clean accuracy and robust accuracy. Advanced methods like TRADES and MART address this through different loss formulations that decouple the robustness objective from clean classification.
Task
Implement a novel adversarial training method in bench/custom_adv_train.py by modifying the AdversarialTrainer class. Your method should improve robust accuracy against white-box L_inf attacks while maintaining reasonable clean accuracy.
Interface
You must implement the AdversarialTrainer class with two methods:
-
__init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs): Initialize your trainer.model: The neural network to train (nn.Module).eps: L_inf perturbation budget (0.3 for MNIST, 8/255 for CIFAR).alpha: Step size for inner PGD attack.attack_steps: Number of PGD steps for adversarial example generation.num_classes: Number of output classes (10 or 100).
-
train_step(self, images, labels, optimizer) -> dict: Perform one training step.images: Clean images, shape(N, C, H, W), values in[0, 1].labels: Ground truth labels, shape(N,).optimizer: SGD optimizer (lr, momentum, weight_decay already configured).- Returns: dict with at least
'loss'key (float).
The training loop, learning rate schedule (cosine annealing), model architecture, and data loading are handled externally. You only control the adversarial training procedure within each step.
Evaluation
After training, models are evaluated on:
- Clean accuracy: Accuracy on unperturbed test images.
- Robust accuracy (FGSM): Accuracy under 1-step FGSM attack.
- Robust accuracy (PGD-50): Accuracy under 50-step PGD attack (primary metric).
Four scenarios (model + dataset):
- SmallCNN on MNIST (eps = 0.3)
- PreActResNet-18 on CIFAR-10 (eps = 8/255)
- VGG-11-BN on CIFAR-10 (eps = 8/255)
- PreActResNet-18 on CIFAR-100 (eps = 8/255)
Higher robust accuracy (PGD-50) across all scenarios is better.
Baselines
standard: Vanilla training (no adversarial examples). High clean accuracy, ~0% robust accuracy.pgdat: PGD Adversarial Training (Madry et al., 2018). Trains on PGD adversarial examples with CE loss.trades: TRADES (Zhang et al., 2019). Balances clean and robust accuracy via KL divergence regularization.mart: MART (Wang et al., 2020). Misclassification-aware regularization that focuses on hard examples.awp: AWP + TRADES (Wu et al., 2020). Adversarial weight perturbation
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_adv_train.py",
4 "start_line": 10,
5 "end_line": 54,
6 "content": "class AdversarialTrainer:\n \"\"\"Dual-Objective TRADES with Misclassification-Aware Regularization (DOT-MAR).\n Novel: (1) dual CE+KL inner attack, (2) confidence-weighted adv CE outer loss.\"\"\"\n\n def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):\n self.model = model\n self.eps = eps\n self.alpha = alpha\n self.attack_steps = attack_steps\n self.num_classes = num_classes\n self.beta = 6.0 # TRADES KL weight\n self.lam = 1.0 # misclassification-aware adv CE weight\n\n def train_step(self, images, labels, optimizer):\n self.model.train()\n logits_clean = self.model(images)\n prob_clean = F.softmax(logits_clean.detach(), dim=1)\n # Inner maximization: dual-objective PGD (CE + KL) for stronger attacks\n self.model.eval()\n adv = images.detach() + torch.empty_like(images).uniform_(-self.eps, self.eps)\n adv = torch.clamp(adv, 0.0, 1.0)\n for _ in range(self.attack_steps):\n adv.requires_grad_(True)\n logits_adv = self.model(adv)\n loss_inner = F.cross_entropy(logits_adv, labels) + F.kl_div(\n F.log_softmax(logits_adv, dim=1), prob_clean, reduction='batchmean')\n grad = torch.autograd.grad(loss_inner, adv)[0]\n adv = adv.detach() + self.alpha * grad.sign()\n delta = torch.clamp(adv - images, -self.eps, self.eps)\n adv = torch.clamp(images + delta, 0.0, 1.0).detach()\n # Outer minimization: clean CE + TRADES KL + misclassification-aware adv CE\n self.model.train()\n logits_clean = self.model(images)\n logits_adv = self.model(adv)\n loss_clean = F.cross_entropy(logits_clean, labels)\n loss_kl = F.kl_div(F.log_softmax(logits_adv, dim=1),\n F.softmax(logits_clean.detach(), dim=1), reduction='batchmean')\n # Weight adv CE by (1 - p(y|x_clean)): focus on uncertain/hard samples\n p_y = prob_clean[torch.arange(len(labels), device=labels.device), labels]\n loss_adv = (F.cross_entropy(logits_adv, labels, reduction='none') * (1 - p_y)).mean()\n loss = loss_clean + self.beta * loss_kl + self.lam * loss_adv\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n return {'loss': loss.item()}\n\n"
7}OK: Replaced lines 10..54 with 46 line(s) in torchattacks/bench/custom_adv_train.py. Editable range: 10–55.
[Current file: torchattacks/bench/custom_adv_train.py | editable: 10–55 | total: 58 lines]
7| # ═══════════════════════════════════════════════════════════════════
8| # EDITABLE — implement AdversarialTrainer below
9| # ═══════════════════════════════════════════════════════════════════
10 class AdversarialTrainer:
11 """Dual-Objective TRADES with Misclassification-Aware Regularization (DOT-MAR).
12 Novel: (1) dual CE+KL inner attack, (2) confidence-weighted adv CE outer loss."""
13
14 def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):
15 self.model = model
16 self.eps = eps
17 self.alpha = alpha
18 self.attack_steps = attack_steps
19 self.num_classes = num_classes
20 self.beta = 6.0 # TRADES KL weight
21 self.
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh) Training progress (SmallCNN-MNIST): TRAIN_METRICS epoch=1 loss=3.1377 lr=0.009961 TRAIN_METRICS epoch=10 loss=0.4807 lr=0.006545 TRAIN_METRICS epoch=20 loss=0.2490 lr=0.000955 Results (SmallCNN-MNIST): clean_acc=0.9905, robust_acc_fgsm=0.9666, robust_acc_pgd=0.9340 ### PreActResNet18-C10 (scripts/PreActResNet18_C10.sh) Training progress (PreActResNet18-C10): TRAIN_METRICS epoch=40 loss=2.6465 lr=0.050000 TRAIN_METRICS epoch=50 loss=2.5836 lr=0.030866 TRAIN_METRICS epoch=60 loss=2.4486 lr=0.014645 TRAIN_METRICS epoch=70 loss=2.1012 lr=0.003806 TRAIN_METRICS epoch=80 loss=1.6572 lr=0.000000 Results (PreActResNet18-C10): clean_acc=0.8578, robust_acc_fgsm=0.5656, robust_acc_pgd=0.4674 ## Seed 123 ### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh) Training progress (SmallCNN-MNIST): TRAIN_METRICS epoch=1 loss=2.9383 lr=0.009961 TRAIN_METRICS epoch=10 loss=0.4539
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_adv_train.py",
4 "start_line": 10,
5 "end_line": 54,
6 "content": "class AdversarialTrainer:\n \"\"\"Hybrid TRADES: boosted KL + adversarial CE with dual-objective attack.\"\"\"\n\n def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):\n self.model = model\n self.eps = eps\n self.alpha = alpha\n self.attack_steps = attack_steps\n self.num_classes = num_classes\n self.beta = 6.0 # KL regularization weight\n\n def train_step(self, images, labels, optimizer):\n self.model.train()\n logits_clean = self.model(images)\n p_clean = F.softmax(logits_clean, dim=1)\n loss_nat = F.cross_entropy(logits_clean, labels)\n # Hybrid PGD attack: maximize KL + CE for stronger adversarial examples\n self.model.eval()\n adv = images.detach() + torch.empty_like(images).uniform_(-self.eps, self.eps)\n adv = torch.clamp(adv, 0.0, 1.0)\n for _ in range(self.attack_steps):\n adv.requires_grad_(True)\n la = self.model(adv)\n lk = F.kl_div(F.log_softmax(la, dim=1), p_clean.detach(),\n reduction='batchmean') + F.cross_entropy(la, labels)\n g = torch.autograd.grad(lk, adv)[0]\n adv = adv.detach() + self.alpha * g.sign()\n adv = torch.clamp(images + torch.clamp(adv - images, -self.eps, self.eps),\n 0.0, 1.0).detach()\n # Outer loss: natural CE + boosted KL + adversarial CE\n self.model.train()\n logits_adv = self.model(adv)\n p_y = p_clean.detach().gather(1, labels.unsqueeze(1)).squeeze(1)\n kl_per = F.kl_div(F.log_softmax(logits_adv, dim=1), p_clean.detach(),\n reduction='none').sum(1)\n boost = 1.0 + (1.0 - p_y) # up-weight uncertain/misclassified samples\n loss_rob = (boost * kl_per).mean()\n loss_adv = F.cross_entropy(logits_adv, labels)\n loss = loss_nat + self.beta * loss_rob + loss_adv\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n return {'loss': loss.item()}\n\n"
7}OK: Replaced lines 10..54 with 44 line(s) in torchattacks/bench/custom_adv_train.py. Editable range: 10–53.
[Current file: torchattacks/bench/custom_adv_train.py | editable: 10–53 | total: 56 lines]
7| # ═══════════════════════════════════════════════════════════════════
8| # EDITABLE — implement AdversarialTrainer below
9| # ═══════════════════════════════════════════════════════════════════
10 class AdversarialTrainer:
11 """Hybrid TRADES: boosted KL + adversarial CE with dual-objective attack."""
12
13 def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):
14 self.model = model
15 self.eps = eps
16 self.alpha = alpha
17 self.attack_steps = attack_steps
18 self.num_classes = num_classes
19 self.beta = 6.0 # KL regularization weight
20
21 def train_step(self, images, labels, optimizer):
22 self.model.train()
23
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh) Training progress (SmallCNN-MNIST): TRAIN_METRICS epoch=1 loss=3.6646 lr=0.009961 TRAIN_METRICS epoch=10 loss=0.5905 lr=0.006545 TRAIN_METRICS epoch=20 loss=0.2841 lr=0.000955 Results (SmallCNN-MNIST): clean_acc=0.9906, robust_acc_fgsm=0.9649, robust_acc_pgd=0.9314 ### PreActResNet18-C10 (scripts/PreActResNet18_C10.sh) Training progress (PreActResNet18-C10): TRAIN_METRICS epoch=40 loss=3.4286 lr=0.050000 TRAIN_METRICS epoch=50 loss=3.3401 lr=0.030866 TRAIN_METRICS epoch=60 loss=3.1586 lr=0.014645 TRAIN_METRICS epoch=70 loss=2.7238 lr=0.003806 TRAIN_METRICS epoch=80 loss=2.1851 lr=0.000000 Results (PreActResNet18-C10): clean_acc=0.8565, robust_acc_fgsm=0.5666, robust_acc_pgd=0.4665 ## Seed 123 ### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh) Training progress (SmallCNN-MNIST): TRAIN_METRICS epoch=1 loss=3.4087 lr=0.009961 TRAIN_METRICS epoch=10 loss=0.5736
Tool Input
1{
2 "op": "replace",
3 "filename": "torchattacks/bench/custom_adv_train.py",
4 "start_line": 10,
5 "end_line": 53,
6 "content": "class AdversarialTrainer:\n \"\"\"Curriculum-Attack TRADES with confidence-weighted KL and AWP.\"\"\"\n def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):\n self.model = model; self.eps = eps; self.alpha = alpha\n self.attack_steps = attack_steps; self.num_classes = num_classes\n self.beta = 6.0; self.awp_gamma = 0.005\n def _loss(self, imgs, adv, labels):\n lc = self.model(imgs); la = self.model(adv)\n pc = F.softmax(lc, dim=1).detach()\n py = pc.gather(1, labels.unsqueeze(1)).squeeze(1)\n kl = F.kl_div(F.log_softmax(la, dim=1), pc, reduction='none').sum(1)\n return F.cross_entropy(lc, labels) + self.beta * ((1.0 + 0.5 * (1.0 - py)) * kl).mean()\n def train_step(self, images, labels, optimizer):\n self.model.train(); lc = self.model(images); pc = F.softmax(lc, dim=1)\n self.model.eval()\n adv = images.detach() + torch.empty_like(images).uniform_(-self.eps, self.eps)\n adv = torch.clamp(adv, 0.0, 1.0); mid = self.attack_steps // 2\n for step in range(self.attack_steps):\n adv.requires_grad_(True); la = self.model(adv)\n if step < mid:\n loss_a = F.kl_div(F.log_softmax(la, dim=1), pc.detach(), reduction='batchmean')\n else:\n loss_a = F.cross_entropy(la, labels)\n g = torch.autograd.grad(loss_a, adv)[0]; adv = adv.detach() + self.alpha * g.sign()\n adv = torch.clamp(images + torch.clamp(adv - images, -self.eps, self.eps), 0.0, 1.0).detach()\n self.model.train()\n loss1 = self._loss(images, adv, labels)\n params = [p for p in self.model.parameters() if p.requires_grad]\n grads = torch.autograd.grad(loss1, params); vs = []\n for p, g in zip(params, grads):\n v = self.awp_gamma * g if len(p.shape) > 1 else torch.zeros_like(p)\n p.data.add_(v); vs.append(v)\n optimizer.zero_grad(); loss2 = self._loss(images, adv, labels); loss2.backward()\n for p, v in zip(params, vs): p.data.sub_(v)\n optimizer.step()\n return {'loss': loss2.item()}\n\n"
7}OK: Replaced lines 10..53 with 37 line(s) in torchattacks/bench/custom_adv_train.py. Editable range: 10–46.
[Current file: torchattacks/bench/custom_adv_train.py | editable: 10–46 | total: 49 lines]
7| # ═══════════════════════════════════════════════════════════════════
8| # EDITABLE — implement AdversarialTrainer below
9| # ═══════════════════════════════════════════════════════════════════
10 class AdversarialTrainer:
11 """Curriculum-Attack TRADES with confidence-weighted KL and AWP."""
12 def __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs):
13 self.model = model; self.eps = eps; self.alpha = alpha
14 self.attack_steps = attack_steps; self.num_classes = num_classes
15 self.beta = 6.0; self.awp_gamma = 0.005
16 def _loss(self, imgs, adv, labels):
17 lc = self.model(imgs); la = self.model(adv)
18 pc = F.softmax(lc, dim=1).detach()
19 py
[Test #2] (1 test remaining; call submit(n=2) to submit this result)
### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh)
Training progress (SmallCNN-MNIST):
TRAIN_METRICS epoch=1 loss=2.2933 lr=0.009961
TRAIN_METRICS epoch=10 loss=0.3888 lr=0.006545
TRAIN_METRICS epoch=20 loss=0.2324 lr=0.000955
Results (SmallCNN-MNIST): clean_acc=0.9909, robust_acc_fgsm=0.9634, robust_acc_pgd=0.9310
### PreActResNet18-C10 (scripts/PreActResNet18_C10.sh)
Training progress (PreActResNet18-C10):
TRAIN_METRICS epoch=40 loss=1.7156 lr=0.050000
TRAIN_METRICS epoch=50 loss=1.6788 lr=0.030866
TRAIN_METRICS epoch=60 loss=1.5996 lr=0.014645
TRAIN_METRICS epoch=70 loss=1.4502 lr=0.003806
TRAIN_METRICS epoch=80 loss=1.2351 lr=0.000000
Results (PreActResNet18-C10): clean_acc=0.8743, robust_acc_fgsm=0.5464, robust_acc_pgd=0.4458
[Leaderboard] Results saved: {'clean_acc_SmallCNN_MNIST': 0.9909, 'robust_acc_fgsm_SmallCNN_MNIST': 0.9634, 'robust_acc_pgd_SmallCNN_MNIST': 0.931, 'elapsed_SmallCNN-MNIST': 491, 'clean_acc_P
Tool Input
1{
2 "n": 1
3}[submit] Submitting result from test #1 as final. ## Seed 42 ### SmallCNN-MNIST (scripts/SmallCNN_MNIST.sh) Training progress (SmallCNN-MNIST): TRAIN_METRICS epoch=1 loss=3.6646 lr=0.009961 TRAIN_METRICS epoch=10 loss=0.5905 lr=0.006545 TRAIN_METRICS epoch=20 loss=0.2841 lr=0.000955 Results (SmallCNN-MNIST): clean_acc=0.9906, robust_acc_fgsm=0.9649, robust_acc_pgd=0.9314 ### PreActResNet18-C10 (scripts/PreActResNet18_C10.sh) Training progress (PreActResNet18-C10): TRAIN_METRICS epoch=40 loss=3.4286 lr=0.050000 TRAIN_METRICS epoch=50 loss=3.3401 lr=0.030866 TRAIN_METRICS epoch=60 loss=3.1586 lr=0.014645 TRAIN_METRICS epoch=70 loss=2.7238 lr=0.003806 TRAIN_METRICS epoch=80 loss=2.1851 lr=0.000000 Results (PreActResNet18-C10): clean_acc=0.8565, robust_acc_fgsm=0.5666, robust_acc_pgd=0.4665 ### PreActResNet18-C100 (scripts/PreActResNet18_C100.sh) Training progress (PreActResNet18-C100): TRAIN_METRICS epoch=40 loss=6.6722 lr=0.050000 TRAIN_METRICS epoch=50 loss=6.4446 lr=0.030866 TRAIN