security-backdoor-defense

Adversarial MLpytorch-visionrigorous codebase

Description

Backdoor Defense via Poisoned-Sample Scoring

Research Question

How can we design a better poisoned-sample scoring rule that identifies backdoored training examples while preserving clean utility after filtering and retraining?

Background

Backdoor attacks implant a trigger pattern into a subset of training examples and relabel them to an attacker-chosen target class. Standard models trained on this data retain high clean accuracy while also predicting the target label whenever the trigger appears. Many defenses try to identify suspicious points using feature statistics, confidence patterns, or clustering structure before retraining on the filtered set.

Task

Implement a stronger backdoor defense in bench/backdoor/custom_backdoor_defense.py. The fixed harness will:

  1. Construct a poisoned training set for a fixed trigger pattern (full dataset, no subsampling).
  2. Train a victim model on the poisoned data for 100 epochs (SGD + CosineAnnealingLR).
  3. Extract features from the penultimate layer and logits for the entire training set.
  4. Call your defense to assign suspicion scores to training examples.
  5. Remove the highest-scoring examples (matching the poison fraction) and retrain on the filtered set for 100 epochs.
  6. Evaluate clean accuracy and attack success rate on triggered test inputs.

Your goal is to reduce backdoor ASR without sacrificing too much clean accuracy.

Editable Interface

You must implement:

class BackdoorDefense:
    def fit(self, features, labels, poison_fraction, **kwargs):
        ...

    def score_samples(self, features, logits):
        ...
  • features: feature matrix of shape (N, D) from a fixed penultimate layer.
  • labels: training labels after poisoning.
  • poison_fraction: approximate fraction of poisoned points in the training data.
  • logits: model logits of shape (N, C).
  • Return value from score_samples: 1D suspicion scores, higher means more suspicious.

The model architecture, poison injection process, filtering budget, and retraining schedule are fixed.

Evaluation

Three benchmark settings are evaluated with research-scale training:

  • resnet20-cifar10-badnets — ResNet-20 on full CIFAR-10, BadNets trigger, 5% poison fraction
  • vgg16bn-cifar100-blend — VGG-16-BN on full CIFAR-100, Blend trigger, 5% poison fraction
  • mobilenetv2-fmnist-badnets — MobileNetV2 on full FashionMNIST, BadNets trigger, 8% poison fraction

All models train for 100 epochs with SGD (lr=0.1, momentum=0.9, weight_decay=5e-4) and cosine annealing schedule.

Reported metrics:

  • clean_acc: clean test accuracy after defense
  • asr: attack success rate on trigger-patched test data
  • poison_recall: fraction of true poisoned points removed by the defense
  • defense_score: aggregate score used for ranking, higher is better

Primary objective: maximize defense_score.

Baselines

  • confidence_filter: ranks samples by target-label confidence
  • spectral_signature: scores by leading singular-vector outlier magnitude
  • activation_clustering: class-conditional cluster-distance heuristic
  • fine_pruning: feature-magnitude pruning style heuristic used as a stronger reference

Code

custom_backdoor_defense.py
EditableRead-only
1"""Editable backdoor defense for MLS-Bench."""
2
3import numpy as np
4import torch
5
6# ============================================================
7# EDITABLE
8# ============================================================
9class BackdoorDefense:
10 """Sample-scoring defense for poisoned-example filtering.
11
12 Given penultimate-layer features and model logits from a
13 backdoor-poisoned training set, score each sample so that
14 higher scores indicate more suspicious (likely poisoned)
15 examples. The fixed harness will remove the top-scoring
run_backdoor_defense.py
EditableRead-only
1"""Fixed evaluation harness for security-backdoor-defense.
2
3Train research-scale vision models (ResNet-20, VGG-16-BN, MobileNetV2) on
4CIFAR-10/100/FashionMNIST with backdoor poisoning, then evaluate a custom
5backdoor defense that scores and removes suspicious training samples.
6
7FIXED: Model architectures, data pipeline, training loop, poison injection.
8EDITABLE: BackdoorDefense class in custom_backdoor_defense.py.
9
10Usage:
11 python run_backdoor_defense.py --arch resnet20 --dataset cifar10 \
12 --data-root /data/cifar --trigger badnets --poison-fraction 0.05 \
13 --epochs 100 --seed 42
14"""
15

Results

ModelTypeclean acc resnet20 cifar10 badnets asr resnet20 cifar10 badnets poison recall resnet20 cifar10 badnets defense score resnet20 cifar10 badnets clean acc vgg16bn cifar100 blend asr vgg16bn cifar100 blend poison recall vgg16bn cifar100 blend defense score vgg16bn cifar100 blend clean acc mobilenetv2 fmnist badnets asr mobilenetv2 fmnist badnets poison recall mobilenetv2 fmnist badnets defense score mobilenetv2 fmnist badnets
activation_clusteringbaseline0.9200.9580.5000.4880.6021.0000.0520.2180.9401.0000.0010.314
activation_clusteringbaseline0.9160.9660.0030.3180.0101.0000.0020.0040.9461.0000.2550.400
confidence_filterbaseline0.9200.9720.3320.4270.6691.0000.0650.2450.9471.0000.2650.404
confidence_filterbaseline0.9090.9720.0000.3120.6961.0000.0000.2320.9411.0000.0000.314
fine_pruningbaseline0.9160.9690.0010.3160.6671.0000.7620.4770.9481.0000.1930.380
fine_pruningbaseline0.9180.9670.2070.3860.7071.0000.0040.2370.9431.0000.0500.331
spectral_signaturebaseline0.9130.9690.0940.3460.6891.0000.8460.5120.9471.0000.0000.316
spectral_signaturebaseline0.9190.9680.3040.4180.6061.0000.0000.2020.9411.0000.0110.317
anthropic/claude-opus-4.6vanilla0.9180.9640.1880.3810.6991.0000.0470.2480.9501.0000.0880.346
deepseek-reasonervanilla------------
deepseek-reasonervanilla0.9150.9680.2660.4040.7091.0000.0000.2360.9381.0000.0400.326
deepseek-reasonervanilla------------
google/gemini-3.1-pro-previewvanilla0.9180.9710.0020.3170.6961.0000.0000.2320.9431.0000.7740.573
openai/gpt-5.4-provanilla0.9180.9460.6800.5510.6630.9990.8010.4880.9481.0000.1880.378
qwen3.6-plus:freevanilla0.9180.9620.3800.4460.6961.0000.0000.2320.9461.0000.0420.329
anthropic/claude-opus-4.6agent0.9230.9480.9230.6320.0101.0000.9090.3060.9461.0000.2530.400
deepseek-reasoneragent0.9120.9640.0070.3180.7121.0000.0000.2370.9401.0000.0000.313
google/gemini-3.1-pro-previewagent0.9190.9420.8950.6240.6961.0000.0000.2320.9451.0000.5210.489
openai/gpt-5.4-proagent0.9150.8860.8370.6220.7071.0000.8400.5160.9451.0000.6650.537
qwen3.6-plus:freeagent0.9180.9620.3800.4460.6961.0000.0000.2320.9461.0000.0420.329

Agent Conversations