security-adversarial-training

Adversarial MLtorchattacksrigorous codebase

Description

Adversarial Training for Model Robustness

Research Question

How to design better adversarial training methods to enhance model robustness against L_inf adversarial attacks?

Background

Adversarial training is the most effective approach for improving neural network robustness against adversarial examples. The standard method (Madry et al., 2018) trains on PGD-generated adversarial examples using cross-entropy loss, but suffers from a trade-off between clean accuracy and robust accuracy. Advanced methods like TRADES and MART address this through different loss formulations that decouple the robustness objective from clean classification.

Task

Implement a novel adversarial training method in bench/custom_adv_train.py by modifying the AdversarialTrainer class. Your method should improve robust accuracy against white-box L_inf attacks while maintaining reasonable clean accuracy.

Interface

You must implement the AdversarialTrainer class with two methods:

  • __init__(self, model, eps, alpha, attack_steps, num_classes, **kwargs): Initialize your trainer.

    • model: The neural network to train (nn.Module).
    • eps: L_inf perturbation budget (0.3 for MNIST, 8/255 for CIFAR).
    • alpha: Step size for inner PGD attack.
    • attack_steps: Number of PGD steps for adversarial example generation.
    • num_classes: Number of output classes (10 or 100).
  • train_step(self, images, labels, optimizer) -> dict: Perform one training step.

    • images: Clean images, shape (N, C, H, W), values in [0, 1].
    • labels: Ground truth labels, shape (N,).
    • optimizer: SGD optimizer (lr, momentum, weight_decay already configured).
    • Returns: dict with at least 'loss' key (float).

The training loop, learning rate schedule (cosine annealing), model architecture, and data loading are handled externally. You only control the adversarial training procedure within each step.

Evaluation

After training, models are evaluated on:

  • Clean accuracy: Accuracy on unperturbed test images.
  • Robust accuracy (FGSM): Accuracy under 1-step FGSM attack.
  • Robust accuracy (PGD-50): Accuracy under 50-step PGD attack (primary metric).

Four scenarios (model + dataset):

  • SmallCNN on MNIST (eps = 0.3)
  • PreActResNet-18 on CIFAR-10 (eps = 8/255)
  • VGG-11-BN on CIFAR-10 (eps = 8/255)
  • PreActResNet-18 on CIFAR-100 (eps = 8/255)

Higher robust accuracy (PGD-50) across all scenarios is better.

Baselines

  • standard: Vanilla training (no adversarial examples). High clean accuracy, ~0% robust accuracy.
  • pgdat: PGD Adversarial Training (Madry et al., 2018). Trains on PGD adversarial examples with CE loss.
  • trades: TRADES (Zhang et al., 2019). Balances clean and robust accuracy via KL divergence regularization.
  • mart: MART (Wang et al., 2020). Misclassification-aware regularization that focuses on hard examples.
  • awp: AWP + TRADES (Wu et al., 2020). Adversarial weight perturbation on top of TRADES — current SOTA.

Code

custom_adv_train.py
EditableRead-only
1"""Custom adversarial training method for MLS-Bench."""
2
3import torch
4import torch.nn as nn
5import torch.nn.functional as F
6
7# ═══════════════════════════════════════════════════════════════════
8# EDITABLE — implement AdversarialTrainer below
9# ═══════════════════════════════════════════════════════════════════
10class AdversarialTrainer:
11 """
12 Adversarial training method.
13
14 The agent should modify this class to implement a better adversarial
15 training procedure that improves model robustness against L_inf attacks.
run_adv_train.py
EditableRead-only
1"""Training and evaluation harness for adversarial training task."""
2
3import argparse
4import random
5
6import numpy as np
7import torch
8import torch.nn as nn
9import torch.nn.functional as F
10from torch.utils.data import DataLoader
11from torchvision import datasets, transforms
12
13from custom_adv_train import AdversarialTrainer
14from models import get_model
15
models.py
EditableRead-only
1"""Model architecture definitions for adversarial training task."""
2
3import torch.nn as nn
4import torch.nn.functional as F
5
6
7class SmallCNN(nn.Module):
8 """Small CNN for MNIST (28x28, 1 channel)."""
9
10 def __init__(self, num_classes=10):
11 super().__init__()
12 self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
13 self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
14 self.fc1 = nn.Linear(64 * 7 * 7, 1024)
15 self.fc2 = nn.Linear(1024, num_classes)

Results

ModelTypeclean acc SmallCNN MNIST robust acc fgsm SmallCNN MNIST robust acc pgd SmallCNN MNIST clean acc PreActResNet18 C10 robust acc fgsm PreActResNet18 C10 robust acc pgd PreActResNet18 C10 clean acc PreActResNet18 C100 robust acc fgsm PreActResNet18 C100 robust acc pgd PreActResNet18 C100
awpbaseline0.9870.9640.9370.7930.4880.4370.5220.2580.228
standardbaseline0.9920.1600.0000.9500.3970.0000.7680.0740.000
tradesbaseline0.9870.9590.9310.7910.4900.4380.5200.2520.223
anthropic/claude-opus-4.6vanilla0.9910.9640.9320.8580.5640.4620.5950.2820.224
google/gemini-3.1-pro-previewvanilla0.1140.1140.1140.1000.1000.1000.3140.1350.108
gpt-5.4-provanilla0.9890.9680.9460.8660.5820.4950.6330.3250.274
anthropic/claude-opus-4.6agent0.9910.9650.9320.8550.5640.4660.5930.2880.234
google/gemini-3.1-pro-previewagent0.9860.9600.9350.8210.5160.4670.5400.2520.222
gpt-5.4-proagent0.9880.9680.9420.8590.5880.5070.6040.3400.301
gpt-5.4-proagent0.9900.9690.9430.8620.5830.5120.6100.3490.313

Agent Conversations