security-adversarial-training
Description
Adversarial Training for Model Robustness
Research Question
How to design better adversarial training methods to enhance model robustness against L_inf adversarial attacks?
Background
Adversarial training is the most effective approach for improving neural network robustness against adversarial examples. The standard method (Madry et al., 2018) trains on PGD-generated adversarial examples using cross-entropy loss, but suffers from a trade-off between clean accuracy and robust accuracy. Advanced methods like TRADES and MART address this through different loss formulations that decouple the robustness objective from clean classification.
Task
Implement a novel adversarial training method in bench/custom_adv_train.py by modifying the AdversarialTrainer class. Your method should improve robust accuracy against white-box L_inf attacks while maintaining reasonable clean accuracy.
Interface
You must implement the AdversarialTrainer class with two methods:
-
__init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs): Initialize your trainer.model: The neural network to train (nn.Module).eps: L_inf perturbation budget (0.3 for MNIST, 8/255 for CIFAR).alpha: Step size for inner PGD attack.attack_steps: Number of PGD steps for adversarial example generation.num_classes: Number of output classes (10 or 100).
-
train_step(self, images, labels, optimizer) -> dict: Perform one training step.images: Clean images, shape(N, C, H, W), values in[0, 1].labels: Ground truth labels, shape(N,).optimizer: SGD optimizer (lr, momentum, weight_decay already configured).- Returns: dict with at least
'loss'key (float).
The training loop, learning rate schedule (cosine annealing), model architecture, and data loading are handled externally. You only control the adversarial training procedure within each step.
Evaluation
After training, models are evaluated on:
- Clean accuracy: Accuracy on unperturbed test images.
- Robust accuracy (FGSM): Accuracy under 1-step FGSM attack.
- Robust accuracy (PGD-50): Accuracy under 50-step PGD attack (primary metric).
Four scenarios (model + dataset):
- SmallCNN on MNIST (eps = 0.3)
- PreActResNet-18 on CIFAR-10 (eps = 8/255)
- VGG-11-BN on CIFAR-10 (eps = 8/255)
- PreActResNet-18 on CIFAR-100 (eps = 8/255)
Higher robust accuracy (PGD-50) across all scenarios is better.
Baselines
standard: Vanilla training (no adversarial examples). High clean accuracy, ~0% robust accuracy.pgdat: PGD Adversarial Training (Madry et al., 2018). Trains on PGD adversarial examples with CE loss.trades: TRADES (Zhang et al., 2019). Balances clean and robust accuracy via KL divergence regularization.mart: MART (Wang et al., 2020). Misclassification-aware regularization that focuses on hard examples.awp: AWP + TRADES (Wu et al., 2020). Adversarial weight perturbation on top of TRADES — current SOTA.
Code
1"""Custom adversarial training method for MLS-Bench."""23import torch4import torch.nn as nn5import torch.nn.functional as F67# ═══════════════════════════════════════════════════════════════════8# EDITABLE — implement AdversarialTrainer below9# ═══════════════════════════════════════════════════════════════════10class AdversarialTrainer:11"""12Adversarial training method.1314The agent should modify this class to implement a better adversarial15training procedure that improves model robustness against L_inf attacks.
1"""Training and evaluation harness for adversarial training task."""23import argparse4import random56import numpy as np7import torch8import torch.nn as nn9import torch.nn.functional as F10from torch.utils.data import DataLoader11from torchvision import datasets, transforms1213from custom_adv_train import AdversarialTrainer14from models import get_model15
1"""Model architecture definitions for adversarial training task."""23import torch.nn as nn4import torch.nn.functional as F567class SmallCNN(nn.Module):8"""Small CNN for MNIST (28x28, 1 channel)."""910def __init__(self, num_classes=10):11super().__init__()12self.conv1 = nn.Conv2d(1, 32, 3, padding=1)13self.conv2 = nn.Conv2d(32, 64, 3, padding=1)14self.fc1 = nn.Linear(64 * 7 * 7, 1024)15self.fc2 = nn.Linear(1024, num_classes)
Results
| Model | Type | clean acc SmallCNN MNIST ↑ | robust acc fgsm SmallCNN MNIST ↑ | robust acc pgd SmallCNN MNIST ↑ | clean acc PreActResNet18 C10 ↑ | robust acc fgsm PreActResNet18 C10 ↑ | robust acc pgd PreActResNet18 C10 ↑ | clean acc PreActResNet18 C100 ↑ | robust acc fgsm PreActResNet18 C100 ↑ | robust acc pgd PreActResNet18 C100 ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| awp | baseline | 0.987 | 0.964 | 0.937 | 0.793 | 0.488 | 0.437 | 0.522 | 0.258 | 0.228 |
| standard | baseline | 0.992 | 0.160 | 0.000 | 0.950 | 0.397 | 0.000 | 0.768 | 0.074 | 0.000 |
| trades | baseline | 0.987 | 0.959 | 0.931 | 0.791 | 0.490 | 0.438 | 0.520 | 0.252 | 0.223 |
| anthropic/claude-opus-4.6 | vanilla | 0.991 | 0.964 | 0.932 | 0.858 | 0.564 | 0.462 | 0.595 | 0.282 | 0.224 |
| google/gemini-3.1-pro-preview | vanilla | 0.114 | 0.114 | 0.114 | 0.100 | 0.100 | 0.100 | 0.314 | 0.135 | 0.108 |
| gpt-5.4-pro | vanilla | 0.989 | 0.968 | 0.946 | 0.866 | 0.582 | 0.495 | 0.633 | 0.325 | 0.274 |
| anthropic/claude-opus-4.6 | agent | 0.991 | 0.965 | 0.932 | 0.855 | 0.564 | 0.466 | 0.593 | 0.288 | 0.234 |
| google/gemini-3.1-pro-preview | agent | 0.986 | 0.960 | 0.935 | 0.821 | 0.516 | 0.467 | 0.540 | 0.252 | 0.222 |
| gpt-5.4-pro | agent | 0.988 | 0.968 | 0.942 | 0.859 | 0.588 | 0.507 | 0.604 | 0.340 | 0.301 |
| gpt-5.4-pro | agent | 0.990 | 0.969 | 0.943 | 0.862 | 0.583 | 0.512 | 0.610 | 0.349 | 0.313 |