security-machine-unlearning
Description
Machine Unlearning via Targeted Update Rules
Research Question
How can we design a stronger unlearning update rule that removes information about a forget set while retaining as much utility as possible on the retained data?
Background
Machine unlearning methods approximate the effect of retraining without the deleted data. The central tradeoff is clear: aggressive forgetting reduces utility, while conservative updates leave measurable traces of the forgotten examples.
The harness pretrains a standard vision model (ResNet-20, VGG-16-BN, or MobileNetV2) on the full training set for 80 epochs using SGD with cosine annealing. After pretraining, a single class is designated as the forget set. Your unlearning method then runs for 20 epochs, receiving both retain-set and forget-set minibatches each step, with an Adam optimizer (lr=0.001).
Task
Implement a better unlearning rule in bench/unlearning/custom_unlearning.py. The fixed harness trains an initial model, defines a forget split, and then applies your update rule for a fixed number of unlearning steps using retain and forget minibatches.
Your method should lower forget-set memorization while preserving retained-task accuracy.
Editable Interface
You must implement:
class UnlearningMethod:
def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
...
retain_batch:(images, labels)tuple from retained data (already on device)forget_batch:(images, labels)tuple from the forget set (already on device)optimizer: fixed Adam optimizer instance (lr=0.001)- Return value: dict with at least
loss
The architecture, initial training, forget split, and evaluation probes are fixed.
Evaluation
Benchmarks:
resnet20-cifar10-class0: ResNet-20 on CIFAR-10, forgetting class 0vgg16bn-cifar100-class0: VGG-16-BN on CIFAR-100, forgetting class 0mobilenetv2-fmnist-class0: MobileNetV2 on FashionMNIST, forgetting class 0
Reported metrics:
retain_acc: accuracy on non-forget test dataforget_acc: accuracy on forget-class test data (lower is better)forget_mia_auc: membership inference attack AUC on forget set (lower is better)unlearn_score: (retain_acc + (1 - forget_acc) + (1 - forget_mia_auc)) / 3
Primary metric: unlearn_score (higher is better).
Baselines
retain_finetune: continue training only on retained datanegative_gradient: ascend forget loss and descend retain lossbad_teacher: distillation-style forgetting baselinescrub: stronger representation-scrubbing baseline
Code
1"""Editable unlearning method for MLS-Bench."""23import torch4import torch.nn.functional as F56# ============================================================7# EDITABLE8# ============================================================9class UnlearningMethod:10"""Default retain-only finetuning update."""1112def __init__(self):13self.forget_weight = 0.01415def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
1"""Fixed evaluation harness for security-machine-unlearning.23Pipeline:41. Load full dataset with standard augmentation52. Split into retain set (all classes except forget_class) and forget set63. Pretrain model on FULL training set for --pretrain-epochs (SGD + CosineAnnealing)74. Run unlearning: agent method processes retain/forget batches for --unlearn-epochs85. Evaluate: retain_acc, forget_acc, forget_mia_auc96. Compute unlearn_score = (retain_acc + (1-forget_acc) + (1-forget_mia_auc)) / 310"""1112import argparse13import math14import os15import random
Results
| Model | Type | retain acc vgg16bn cifar100 class0 ↑ | forget acc vgg16bn cifar100 class0 ↓ | forget mia auc vgg16bn cifar100 class0 ↓ | unlearn score vgg16bn cifar100 class0 ↑ | retain acc resnet20 cifar10 class0 ↑ | forget acc resnet20 cifar10 class0 ↓ | forget mia auc resnet20 cifar10 class0 ↓ | unlearn score resnet20 cifar10 class0 ↑ | retain acc mobilenetv2 fmnist class0 ↑ | forget acc mobilenetv2 fmnist class0 ↓ | forget mia auc mobilenetv2 fmnist class0 ↓ | unlearn score mobilenetv2 fmnist class0 ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bad_teacher | baseline | 0.463 | 0.000 | 0.420 | 0.681 | 0.844 | 0.001 | 0.414 | 0.810 | 0.929 | 0.000 | 0.494 | 0.812 |
| negative_gradient | baseline | 0.010 | 0.000 | 0.363 | 0.549 | 0.173 | 0.000 | 0.126 | 0.682 | 0.111 | 0.000 | 0.038 | 0.691 |
| retain_finetune | baseline | 0.534 | 0.000 | 0.476 | 0.686 | 0.876 | 0.000 | 0.451 | 0.808 | 0.937 | 0.000 | 0.482 | 0.819 |
| scrub | baseline | 0.451 | 0.000 | 0.440 | 0.670 | 0.831 | 0.000 | 0.397 | 0.811 | 0.924 | 0.000 | 0.521 | 0.801 |
| anthropic/claude-opus-4.6 | vanilla | 0.392 | 0.000 | 0.412 | 0.660 | 0.199 | 0.000 | 0.460 | 0.580 | 0.858 | 0.003 | 0.467 | 0.796 |
| deepseek-reasoner | vanilla | 0.010 | 0.000 | 0.363 | 0.549 | 0.123 | 0.000 | 0.146 | 0.659 | 0.112 | 0.000 | 0.046 | 0.689 |
| google/gemini-3.1-pro-preview | vanilla | 0.514 | 0.000 | 0.518 | 0.665 | 0.909 | 0.000 | 0.439 | 0.824 | 0.948 | 0.000 | 0.521 | 0.809 |
| openai/gpt-5.4-pro | vanilla | 0.522 | 0.000 | 0.508 | 0.671 | 0.901 | 0.000 | 0.429 | 0.824 | 0.943 | 0.000 | 0.518 | 0.808 |
| qwen3.6-plus:free | vanilla | 0.487 | 0.000 | 0.414 | 0.691 | 0.854 | 0.000 | 0.418 | 0.812 | 0.934 | 0.000 | 0.503 | 0.810 |
| anthropic/claude-opus-4.6 | agent | 0.038 | 0.000 | 0.453 | 0.528 | 0.869 | 0.033 | 0.455 | 0.794 | 0.884 | 0.000 | 0.495 | 0.796 |
| deepseek-reasoner | agent | 0.089 | 0.000 | 0.569 | 0.507 | 0.157 | 0.000 | 0.264 | 0.631 | 0.111 | 0.000 | 0.048 | 0.688 |
| google/gemini-3.1-pro-preview | agent | 0.549 | 0.000 | 0.491 | 0.686 | 0.909 | 0.003 | 0.420 | 0.829 | - | - | - | - |
| openai/gpt-5.4-pro | agent | 0.522 | 0.000 | 0.508 | 0.671 | 0.901 | 0.000 | 0.429 | 0.824 | 0.943 | 0.000 | 0.518 | 0.808 |
| qwen3.6-plus:free | agent | 0.464 | 0.000 | 0.409 | 0.685 | 0.859 | 0.001 | 0.391 | 0.822 | 0.933 | 0.000 | 0.482 | 0.817 |