security-machine-unlearning

Adversarial MLpytorch-visionrigorous codebase

Description

Machine Unlearning via Targeted Update Rules

Research Question

How can we design a stronger unlearning update rule that removes information about a forget set while retaining as much utility as possible on the retained data?

Background

Machine unlearning methods approximate the effect of retraining without the deleted data. The central tradeoff is clear: aggressive forgetting reduces utility, while conservative updates leave measurable traces of the forgotten examples.

The harness pretrains a standard vision model (ResNet-20, VGG-16-BN, or MobileNetV2) on the full training set for 80 epochs using SGD with cosine annealing. After pretraining, a single class is designated as the forget set. Your unlearning method then runs for 20 epochs, receiving both retain-set and forget-set minibatches each step, with an Adam optimizer (lr=0.001).

Task

Implement a better unlearning rule in bench/unlearning/custom_unlearning.py. The fixed harness trains an initial model, defines a forget split, and then applies your update rule for a fixed number of unlearning steps using retain and forget minibatches.

Your method should lower forget-set memorization while preserving retained-task accuracy.

Editable Interface

You must implement:

class UnlearningMethod:
    def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
        ...

retain_batch: (images, labels) tuple from retained data (already on device)
forget_batch: (images, labels) tuple from the forget set (already on device)
optimizer: fixed Adam optimizer instance (lr=0.001)
Return value: dict with at least loss

The architecture, initial training, forget split, and evaluation probes are fixed.

Evaluation

Benchmarks:

resnet20-cifar10-class0: ResNet-20 on CIFAR-10, forgetting class 0
vgg16bn-cifar100-class0: VGG-16-BN on CIFAR-100, forgetting class 0
mobilenetv2-fmnist-class0: MobileNetV2 on FashionMNIST, forgetting class 0

Reported metrics:

retain_acc: accuracy on non-forget test data
forget_acc: accuracy on forget-class test data (lower is better)
forget_mia_auc: membership inference attack AUC on forget set (lower is better)
unlearn_score: (retain_acc + (1 - forget_acc) + (1 - forget_mia_auc)) / 3

Primary metric: unlearn_score (higher is better).

Baselines

retain_finetune: continue training only on retained data
negative_gradient: ascend forget loss and descend retain loss
bad_teacher: distillation-style forgetting baseline
scrub: stronger representation-scrubbing baseline

Code

custom_unlearning.py

EditableRead-only

1"""Editable unlearning method for MLS-Bench."""
2
3import torch
4import torch.nn.functional as F
5
6# ============================================================
7# EDITABLE
8# ============================================================
9class UnlearningMethod:
10    """Default retain-only finetuning update."""
11
12    def __init__(self):
13        self.forget_weight = 0.0
14
15    def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):

run_unlearning.py

EditableRead-only

1"""Fixed evaluation harness for security-machine-unlearning.
2
3Pipeline:
4  1. Load full dataset with standard augmentation
5  2. Split into retain set (all classes except forget_class) and forget set
6  3. Pretrain model on FULL training set for --pretrain-epochs (SGD + CosineAnnealing)
7  4. Run unlearning: agent method processes retain/forget batches for --unlearn-epochs
8  5. Evaluate: retain_acc, forget_acc, forget_mia_auc
9  6. Compute unlearn_score = (retain_acc + (1-forget_acc) + (1-forget_mia_auc)) / 3
10"""
11
12import argparse
13import math
14import os
15import random

Results

Show per-seed results

Model	Type	retain acc vgg16bn cifar100 class0 ↑	forget mia auc vgg16bn cifar100 class0 ↓	unlearn score vgg16bn cifar100 class0 ↑	retain acc resnet20 cifar10 class0 ↑	forget acc resnet20 cifar10 class0 ↓	forget mia auc resnet20 cifar10 class0 ↓	unlearn score resnet20 cifar10 class0 ↑	retain acc mobilenetv2 fmnist class0 ↑	forget acc mobilenetv2 fmnist class0 ↓	forget mia auc mobilenetv2 fmnist class0 ↓	unlearn score mobilenetv2 fmnist class0 ↑
bad_teacher	baseline	0.463	0.420	0.681	0.844	0.001	0.414	0.810	0.929	0.000	0.494	0.812
negative_gradient	baseline	0.010	0.363	0.549	0.173	0.000	0.126	0.682	0.111	0.000	0.038	0.691
retain_finetune	baseline	0.534	0.476	0.686	0.876	0.000	0.451	0.808	0.937	0.000	0.482	0.819
scrub	baseline	0.451	0.440	0.670	0.831	0.000	0.397	0.811	0.924	0.000	0.521	0.801
anthropic/claude-opus-4.6	vanilla	0.392	0.412	0.660	0.199	0.000	0.460	0.580	0.858	0.003	0.467	0.796
deepseek-reasoner	vanilla	0.010	0.363	0.549	0.123	0.000	0.146	0.659	0.112	0.000	0.046	0.689
google/gemini-3.1-pro-preview	vanilla	0.514	0.518	0.665	0.909	0.000	0.439	0.824	0.948	0.000	0.521	0.809
openai/gpt-5.4-pro	vanilla	0.522	0.508	0.671	0.901	0.000	0.429	0.824	0.943	0.000	0.518	0.808
qwen3.6-plus:free	vanilla	0.487	0.414	0.691	0.854	0.000	0.418	0.812	0.934	0.000	0.503	0.810
anthropic/claude-opus-4.6	agent	0.038	0.453	0.528	0.869	0.033	0.455	0.794	0.884	0.000	0.495	0.796
deepseek-reasoner	agent	0.089	0.569	0.507	0.157	0.000	0.264	0.631	0.111	0.000	0.048	0.688
google/gemini-3.1-pro-preview	agent	0.549	0.491	0.686	0.909	0.003	0.420	0.829	-	-	-	-
openai/gpt-5.4-pro	agent	0.522	0.508	0.671	0.901	0.000	0.429	0.824	0.943	0.000	0.518	0.808
qwen3.6-plus:free	agent	0.464	0.409	0.685	0.859	0.001	0.391	0.822	0.933	0.000	0.482	0.817

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

6 steps

google/gemini-3.1-pro-preview

7 steps

openai/gpt-5.4-pro

6 steps