security-backdoor-defense
Description
Backdoor Defense via Poisoned-Sample Scoring
Research Question
How can we design a better poisoned-sample scoring rule that identifies backdoored training examples while preserving clean utility after filtering and retraining?
Background
Backdoor attacks implant a trigger pattern into a subset of training examples and relabel them to an attacker-chosen target class. Standard models trained on this data retain high clean accuracy while also predicting the target label whenever the trigger appears. Many defenses try to identify suspicious points using feature statistics, confidence patterns, or clustering structure before retraining on the filtered set.
Task
Implement a stronger backdoor defense in bench/backdoor/custom_backdoor_defense.py. The fixed harness will:
- Construct a poisoned training set for a fixed trigger pattern (full dataset, no subsampling).
- Train a victim model on the poisoned data for 100 epochs (SGD + CosineAnnealingLR).
- Extract features from the penultimate layer and logits for the entire training set.
- Call your defense to assign suspicion scores to training examples.
- Remove the highest-scoring examples (matching the poison fraction) and retrain on the filtered set for 100 epochs.
- Evaluate clean accuracy and attack success rate on triggered test inputs.
Your goal is to reduce backdoor ASR without sacrificing too much clean accuracy.
Editable Interface
You must implement:
class BackdoorDefense:
def fit(self, features, labels, poison_fraction, **kwargs):
...
def score_samples(self, features, logits):
...
features: feature matrix of shape(N, D)from a fixed penultimate layer.labels: training labels after poisoning.poison_fraction: approximate fraction of poisoned points in the training data.logits: model logits of shape(N, C).- Return value from
score_samples: 1D suspicion scores, higher means more suspicious.
The model architecture, poison injection process, filtering budget, and retraining schedule are fixed.
Evaluation
Three benchmark settings are evaluated with research-scale training:
resnet20-cifar10-badnets— ResNet-20 on full CIFAR-10, BadNets trigger, 5% poison fractionvgg16bn-cifar100-blend— VGG-16-BN on full CIFAR-100, Blend trigger, 5% poison fractionmobilenetv2-fmnist-badnets— MobileNetV2 on full FashionMNIST, BadNets trigger, 8% poison fraction
All models train for 100 epochs with SGD (lr=0.1, momentum=0.9, weight_decay=5e-4) and cosine annealing schedule.
Reported metrics:
clean_acc: clean test accuracy after defenseasr: attack success rate on trigger-patched test datapoison_recall: fraction of true poisoned points removed by the defensedefense_score: aggregate score used for ranking, higher is better
Primary objective: maximize defense_score.
Baselines
confidence_filter: ranks samples by target-label confidencespectral_signature: scores by leading singular-vector outlier magnitudeactivation_clustering: class-conditional cluster-distance heuristicfine_pruning: feature-magnitude pruning style heuristic used as a stronger reference
Code
1"""Editable backdoor defense for MLS-Bench."""23import numpy as np4import torch56# ============================================================7# EDITABLE8# ============================================================9class BackdoorDefense:10"""Sample-scoring defense for poisoned-example filtering.1112Given penultimate-layer features and model logits from a13backdoor-poisoned training set, score each sample so that14higher scores indicate more suspicious (likely poisoned)15examples. The fixed harness will remove the top-scoring
1"""Fixed evaluation harness for security-backdoor-defense.23Train research-scale vision models (ResNet-20, VGG-16-BN, MobileNetV2) on4CIFAR-10/100/FashionMNIST with backdoor poisoning, then evaluate a custom5backdoor defense that scores and removes suspicious training samples.67FIXED: Model architectures, data pipeline, training loop, poison injection.8EDITABLE: BackdoorDefense class in custom_backdoor_defense.py.910Usage:11python run_backdoor_defense.py --arch resnet20 --dataset cifar10 \12--data-root /data/cifar --trigger badnets --poison-fraction 0.05 \13--epochs 100 --seed 4214"""15
Results
| Model | Type | clean acc resnet20 cifar10 badnets ↑ | asr resnet20 cifar10 badnets ↓ | poison recall resnet20 cifar10 badnets ↑ | defense score resnet20 cifar10 badnets ↑ | clean acc vgg16bn cifar100 blend ↑ | asr vgg16bn cifar100 blend ↓ | poison recall vgg16bn cifar100 blend ↑ | defense score vgg16bn cifar100 blend ↑ | clean acc mobilenetv2 fmnist badnets ↑ | asr mobilenetv2 fmnist badnets ↓ | poison recall mobilenetv2 fmnist badnets ↑ | defense score mobilenetv2 fmnist badnets ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| activation_clustering | baseline | 0.920 | 0.958 | 0.500 | 0.488 | 0.602 | 1.000 | 0.052 | 0.218 | 0.940 | 1.000 | 0.001 | 0.314 |
| activation_clustering | baseline | 0.916 | 0.966 | 0.003 | 0.318 | 0.010 | 1.000 | 0.002 | 0.004 | 0.946 | 1.000 | 0.255 | 0.400 |
| confidence_filter | baseline | 0.920 | 0.972 | 0.332 | 0.427 | 0.669 | 1.000 | 0.065 | 0.245 | 0.947 | 1.000 | 0.265 | 0.404 |
| confidence_filter | baseline | 0.909 | 0.972 | 0.000 | 0.312 | 0.696 | 1.000 | 0.000 | 0.232 | 0.941 | 1.000 | 0.000 | 0.314 |
| fine_pruning | baseline | 0.916 | 0.969 | 0.001 | 0.316 | 0.667 | 1.000 | 0.762 | 0.477 | 0.948 | 1.000 | 0.193 | 0.380 |
| fine_pruning | baseline | 0.918 | 0.967 | 0.207 | 0.386 | 0.707 | 1.000 | 0.004 | 0.237 | 0.943 | 1.000 | 0.050 | 0.331 |
| spectral_signature | baseline | 0.913 | 0.969 | 0.094 | 0.346 | 0.689 | 1.000 | 0.846 | 0.512 | 0.947 | 1.000 | 0.000 | 0.316 |
| spectral_signature | baseline | 0.919 | 0.968 | 0.304 | 0.418 | 0.606 | 1.000 | 0.000 | 0.202 | 0.941 | 1.000 | 0.011 | 0.317 |
| anthropic/claude-opus-4.6 | vanilla | 0.918 | 0.964 | 0.188 | 0.381 | 0.699 | 1.000 | 0.047 | 0.248 | 0.950 | 1.000 | 0.088 | 0.346 |
| deepseek-reasoner | vanilla | - | - | - | - | - | - | - | - | - | - | - | - |
| deepseek-reasoner | vanilla | 0.915 | 0.968 | 0.266 | 0.404 | 0.709 | 1.000 | 0.000 | 0.236 | 0.938 | 1.000 | 0.040 | 0.326 |
| deepseek-reasoner | vanilla | - | - | - | - | - | - | - | - | - | - | - | - |
| google/gemini-3.1-pro-preview | vanilla | 0.918 | 0.971 | 0.002 | 0.317 | 0.696 | 1.000 | 0.000 | 0.232 | 0.943 | 1.000 | 0.774 | 0.573 |
| openai/gpt-5.4-pro | vanilla | 0.918 | 0.946 | 0.680 | 0.551 | 0.663 | 0.999 | 0.801 | 0.488 | 0.948 | 1.000 | 0.188 | 0.378 |
| qwen3.6-plus:free | vanilla | 0.918 | 0.962 | 0.380 | 0.446 | 0.696 | 1.000 | 0.000 | 0.232 | 0.946 | 1.000 | 0.042 | 0.329 |
| anthropic/claude-opus-4.6 | agent | 0.923 | 0.948 | 0.923 | 0.632 | 0.010 | 1.000 | 0.909 | 0.306 | 0.946 | 1.000 | 0.253 | 0.400 |
| deepseek-reasoner | agent | 0.912 | 0.964 | 0.007 | 0.318 | 0.712 | 1.000 | 0.000 | 0.237 | 0.940 | 1.000 | 0.000 | 0.313 |
| google/gemini-3.1-pro-preview | agent | 0.919 | 0.942 | 0.895 | 0.624 | 0.696 | 1.000 | 0.000 | 0.232 | 0.945 | 1.000 | 0.521 | 0.489 |
| openai/gpt-5.4-pro | agent | 0.915 | 0.886 | 0.837 | 0.622 | 0.707 | 1.000 | 0.840 | 0.516 | 0.945 | 1.000 | 0.665 | 0.537 |
| qwen3.6-plus:free | agent | 0.918 | 0.962 | 0.380 | 0.446 | 0.696 | 1.000 | 0.000 | 0.232 | 0.946 | 1.000 | 0.042 | 0.329 |