security-backdoor-defense

Adversarial MLpytorch-visionrigorous codebase

Description

Backdoor Defense via Poisoned-Sample Scoring

Research Question

How can we design a better poisoned-sample scoring rule that identifies backdoored training examples while preserving clean utility after filtering and retraining?

Background

Backdoor attacks implant a trigger pattern into a subset of training examples and relabel them to an attacker-chosen target class. Standard models trained on this data retain high clean accuracy while also predicting the target label whenever the trigger appears. Many defenses try to identify suspicious points using feature statistics, confidence patterns, or clustering structure before retraining on the filtered set.

Task

Implement a stronger backdoor defense in bench/backdoor/custom_backdoor_defense.py. The fixed harness will:

Construct a poisoned training set for a fixed trigger pattern (full dataset, no subsampling).
Train a victim model on the poisoned data for 100 epochs (SGD + CosineAnnealingLR).
Extract features from the penultimate layer and logits for the entire training set.
Call your defense to assign suspicion scores to training examples.
Remove the highest-scoring examples (matching the poison fraction) and retrain on the filtered set for 100 epochs.
Evaluate clean accuracy and attack success rate on triggered test inputs.

Your goal is to reduce backdoor ASR without sacrificing too much clean accuracy.

Editable Interface

You must implement:

class BackdoorDefense:
    def fit(self, features, labels, poison_fraction, **kwargs):
        ...

    def score_samples(self, features, logits):
        ...

features: feature matrix of shape (N, D) from a fixed penultimate layer.
labels: training labels after poisoning.
poison_fraction: approximate fraction of poisoned points in the training data.
logits: model logits of shape (N, C).
Return value from score_samples: 1D suspicion scores, higher means more suspicious.

The model architecture, poison injection process, filtering budget, and retraining schedule are fixed.

Evaluation

Three benchmark settings are evaluated with research-scale training:

resnet20-cifar10-badnets — ResNet-20 on full CIFAR-10, BadNets trigger, 5% poison fraction
vgg16bn-cifar100-blend — VGG-16-BN on full CIFAR-100, Blend trigger, 5% poison fraction
mobilenetv2-fmnist-badnets — MobileNetV2 on full FashionMNIST, BadNets trigger, 8% poison fraction

All models train for 100 epochs with SGD (lr=0.1, momentum=0.9, weight_decay=5e-4) and cosine annealing schedule.

Reported metrics:

clean_acc: clean test accuracy after defense
asr: attack success rate on trigger-patched test data
poison_recall: fraction of true poisoned points removed by the defense
defense_score: aggregate score used for ranking, higher is better

Primary objective: maximize defense_score.

Baselines

confidence_filter: ranks samples by target-label confidence
spectral_signature: scores by leading singular-vector outlier magnitude
activation_clustering: class-conditional cluster-distance heuristic
fine_pruning: feature-magnitude pruning style heuristic used as a stronger reference

Code

custom_backdoor_defense.py

EditableRead-only

1"""Editable backdoor defense for MLS-Bench."""
2
3import numpy as np
4import torch
5
6# ============================================================
7# EDITABLE
8# ============================================================
9class BackdoorDefense:
10    """Sample-scoring defense for poisoned-example filtering.
11
12    Given penultimate-layer features and model logits from a
13    backdoor-poisoned training set, score each sample so that
14    higher scores indicate more suspicious (likely poisoned)
15    examples. The fixed harness will remove the top-scoring

run_backdoor_defense.py

EditableRead-only

1"""Fixed evaluation harness for security-backdoor-defense.
2
3Train research-scale vision models (ResNet-20, VGG-16-BN, MobileNetV2) on
4CIFAR-10/100/FashionMNIST with backdoor poisoning, then evaluate a custom
5backdoor defense that scores and removes suspicious training samples.
6
7FIXED: Model architectures, data pipeline, training loop, poison injection.
8EDITABLE: BackdoorDefense class in custom_backdoor_defense.py.
9
10Usage:
11    python run_backdoor_defense.py --arch resnet20 --dataset cifar10 \
12        --data-root /data/cifar --trigger badnets --poison-fraction 0.05 \
13        --epochs 100 --seed 42
14"""
15

Results

Model	Type	clean acc resnet20 cifar10 badnets ↑	asr resnet20 cifar10 badnets ↓	poison recall resnet20 cifar10 badnets ↑	defense score resnet20 cifar10 badnets ↑	clean acc vgg16bn cifar100 blend ↑	asr vgg16bn cifar100 blend ↓	poison recall vgg16bn cifar100 blend ↑	defense score vgg16bn cifar100 blend ↑	clean acc mobilenetv2 fmnist badnets ↑	asr mobilenetv2 fmnist badnets ↓	poison recall mobilenetv2 fmnist badnets ↑	defense score mobilenetv2 fmnist badnets ↑
activation_clustering	baseline	0.920	0.958	0.500	0.488	0.602	1.000	0.052	0.218	0.940	1.000	0.001	0.314
activation_clustering	baseline	0.916	0.966	0.003	0.318	0.010	1.000	0.002	0.004	0.946	1.000	0.255	0.400
confidence_filter	baseline	0.920	0.972	0.332	0.427	0.669	1.000	0.065	0.245	0.947	1.000	0.265	0.404
confidence_filter	baseline	0.909	0.972	0.000	0.312	0.696	1.000	0.000	0.232	0.941	1.000	0.000	0.314
fine_pruning	baseline	0.916	0.969	0.001	0.316	0.667	1.000	0.762	0.477	0.948	1.000	0.193	0.380
fine_pruning	baseline	0.918	0.967	0.207	0.386	0.707	1.000	0.004	0.237	0.943	1.000	0.050	0.331
spectral_signature	baseline	0.913	0.969	0.094	0.346	0.689	1.000	0.846	0.512	0.947	1.000	0.000	0.316
spectral_signature	baseline	0.919	0.968	0.304	0.418	0.606	1.000	0.000	0.202	0.941	1.000	0.011	0.317
anthropic/claude-opus-4.6	vanilla	0.918	0.964	0.188	0.381	0.699	1.000	0.047	0.248	0.950	1.000	0.088	0.346
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-	-	-	-	-
deepseek-reasoner	vanilla	0.915	0.968	0.266	0.404	0.709	1.000	0.000	0.236	0.938	1.000	0.040	0.326
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-	-	-	-	-
google/gemini-3.1-pro-preview	vanilla	0.918	0.971	0.002	0.317	0.696	1.000	0.000	0.232	0.943	1.000	0.774	0.573
openai/gpt-5.4-pro	vanilla	0.918	0.946	0.680	0.551	0.663	0.999	0.801	0.488	0.948	1.000	0.188	0.378
qwen3.6-plus:free	vanilla	0.918	0.962	0.380	0.446	0.696	1.000	0.000	0.232	0.946	1.000	0.042	0.329
anthropic/claude-opus-4.6	agent	0.923	0.948	0.923	0.632	0.010	1.000	0.909	0.306	0.946	1.000	0.253	0.400
deepseek-reasoner	agent	0.912	0.964	0.007	0.318	0.712	1.000	0.000	0.237	0.940	1.000	0.000	0.313
google/gemini-3.1-pro-preview	agent	0.919	0.942	0.895	0.624	0.696	1.000	0.000	0.232	0.945	1.000	0.521	0.489
openai/gpt-5.4-pro	agent	0.915	0.886	0.837	0.622	0.707	1.000	0.840	0.516	0.945	1.000	0.665	0.537
qwen3.6-plus:free	agent	0.918	0.962	0.380	0.446	0.696	1.000	0.000	0.232	0.946	1.000	0.042	0.329

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview

7 steps

openai/gpt-5.4-pro

5 steps