Agent Conversation: google/gemini-3.1-pro-preview
Task: security-machine-unlearning
Machine Unlearning via Targeted Update Rules
Research Question
How can we design a stronger unlearning update rule that removes information about a forget set while retaining as much utility as possible on the retained data?
Background
Machine unlearning methods approximate the effect of retraining without the deleted data. The central tradeoff is clear: aggressive forgetting reduces utility, while conservative updates leave measurable traces of the forgotten examples.
Task
Implement a better unlearning rule in bench/unlearning/custom_unlearning.py. The fixed harness trains an initial model, defines a forget split, and then applies your update rule for a fixed number of unlearning steps using retain and forget minibatches.
Your method should lower forget-set memorization while preserving retained-task accuracy.
Editable Interface
You must implement:
class UnlearningMethod:
def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
...
retain_batch: minibatch sampled from retained dataforget_batch: minibatch sampled from the forget setoptimizer: fixed optimizer instance- Return value: dict with at least
loss
The architecture, initial training, forget split, and evaluation probes are fixed.
Evaluation
Benchmarks:
mnist-class0fmnist-class0cifar10-class0
Reported metrics:
retain_accforget_accforget_mia_aucunlearn_score
Primary metric: unlearn_score (higher is better).
Baselines
retain_finetune: continue training only on retained datanegative_gradient: ascend forget loss and descend retain lossbad_teacher: distillation-style forgetting baselinescrub: stronger representation-scrubbing baseline
opacus/bench/unlearning/custom_unlearning.py [EDITABLE — entire file only]
1: """Editable unlearning method for MLS-Bench."""
2:
3: import torch
4: import torch.nn.functional as F
5:
6: # ============================================================
7: # EDITABLE
8: # ============================================================
9: class UnlearningMethod:
10: """Default retain-only finetuning update."""
11:
12: def __init__(self):
13: self.forget_weight = 0.0
14:
15: def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
16: retain_x, retain_y = retain_batch
17: logits = model(retain_x)
18: loss = F.cross_entropy(logits, retain_y)
19: optimizer.zero_grad()
20: loss.backward()
21: optimizer.step()
22: return {"loss": loss.item()}
23: # ============================================================
24: # END EDITABLE
25: # ============================================================
retain_finetune baseline — editable region [READ-ONLY — ref
Tool Input
1{
2 "end_line": 22,
3 "start_line": 9,
4 "filename": "opacus/bench/unlearning/custom_unlearning.py",
5 "op": "replace",
6 "content": "class UnlearningMethod:\n \"\"\"Uniform over other classes to completely eliminate the true class probability gracefully.\"\"\"\n\n def __init__(self):\n self.forget_weight = 1.0\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n # Retain task: fit normally\n retain_logits = model(retain_x)\n retain_loss = F.cross_entropy(retain_logits, retain_y)\n\n # Forget task: force to uniform over ALL OTHER classes\n forget_logits = model(forget_x)\n num_classes = forget_logits.size(1)\n log_probs = F.log_softmax(forget_logits, dim=1)\n \n # Zero out the true class log-probability\n mask = torch.ones_like(forget_logits)\n mask.scatter_(1, forget_y.unsqueeze(1), 0.0)\n \n # Average over (num_classes - 1)\n forget_loss = -(log_probs * mask).sum(dim=1) / (num_classes - 1)\n forget_loss = forget_loss.mean()\n\n loss = retain_loss + self.forget_weight * forget_loss\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return {\"loss\": loss.item(), \"retain_loss\": retain_loss.item(), \"forget_loss\": forget_loss.item()}",
7 "after_line": 1
8}OK: Replaced lines 9..22 with 34 line(s) in opacus/bench/unlearning/custom_unlearning.py. Editable range: entire file.
[Current file: opacus/bench/unlearning/custom_unlearning.py | editable: 1–45 | total: 45 lines]
1 """Editable unlearning method for MLS-Bench."""
2
3 import torch
4 import torch.nn.functional as F
5
6 # ============================================================
7 # EDITABLE
8 # ============================================================
9 class UnlearningMethod:
10 """Uniform over other classes to completely eliminate the true class probability gracefully."""
11
12 def __init__(self):
13 self.forget_weight = 1.0
14
15 def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
16 retain_x, retain_y = retain_batch
17 forget_x, forget_y = forget_batch
18
19 # Retain task: fit normally
20
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### mnist-class0 (scripts/mnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.2913 TRAIN_METRICS phase=pretrain epoch=2 loss=1.8422 TRAIN_METRICS phase=pretrain epoch=3 loss=1.2384 TRAIN_METRICS phase=pretrain epoch=4 loss=0.6924 TRAIN_METRICS phase=unlearn epoch=1 loss=5.3688 TRAIN_METRICS phase=unlearn epoch=2 loss=3.8308 mnist-class0: retain_acc=0.7047, forget_acc=0.0000, forget_mia_auc=0.4072, unlearn_score=0.7658 ### fmnist-class0 (scripts/fmnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.1444 TRAIN_METRICS phase=pretrain epoch=2 loss=1.2923 TRAIN_METRICS phase=pretrain epoch=3 loss=0.9683 TRAIN_METRICS phase=pretrain epoch=4 loss=0.8313 TRAIN_METRICS phase=unlearn epoch=1 loss=5.8478 TRAIN_METRICS phase=unlearn epoch=2 loss=4.0048 fmnist-class0: retain_acc=0.4572, forget_acc=0.0000, forget_mia_auc=0.4775, unlearn_score=0.6599 ### cifar10-class0 (scripts/cifar10_class0.sh) TR
Tool Input
1{
2 "end_line": 40,
3 "after_line": 1,
4 "start_line": 9,
5 "content": "class UnlearningMethod:\n \"\"\"Next-Best Class Unlearning: maps forget examples to the most likely incorrect class.\"\"\"\n\n def __init__(self):\n self.forget_weight = 1.0\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n # Retain task\n retain_logits = model(retain_x)\n retain_loss = F.cross_entropy(retain_logits, retain_y)\n\n # Forget task: force to most likely alternative class\n forget_logits = model(forget_x)\n with torch.no_grad():\n masked_logits = forget_logits.clone()\n # Mask out the true class with negative infinity\n masked_logits.scatter_(1, forget_y.unsqueeze(1), -float('inf'))\n # Predict the next most likely class\n alt_labels = masked_logits.argmax(dim=1)\n \n forget_loss = F.cross_entropy(forget_logits, alt_labels)\n\n loss = retain_loss + self.forget_weight * forget_loss\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return {\"loss\": loss.item(), \"retain_loss\": retain_loss.item(), \"forget_loss\": forget_loss.item()}",
6 "op": "replace",
7 "filename": "opacus/bench/unlearning/custom_unlearning.py"
8}OK: Replaced lines 9..40 with 32 line(s) in opacus/bench/unlearning/custom_unlearning.py. Editable range: entire file.
[Current file: opacus/bench/unlearning/custom_unlearning.py | editable: 1–45 | total: 45 lines]
1 """Editable unlearning method for MLS-Bench."""
2
3 import torch
4 import torch.nn.functional as F
5
6 # ============================================================
7 # EDITABLE
8 # ============================================================
9 class UnlearningMethod:
10 """Next-Best Class Unlearning: maps forget examples to the most likely incorrect class."""
11
12 def __init__(self):
13 self.forget_weight = 1.0
14
15 def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
16 retain_x, retain_y = retain_batch
17 forget_x, forget_y = forget_batch
18
19 # Retain task
20 retain_logit
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### mnist-class0 (scripts/mnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.2914 TRAIN_METRICS phase=pretrain epoch=2 loss=1.8466 TRAIN_METRICS phase=pretrain epoch=3 loss=1.2840 TRAIN_METRICS phase=pretrain epoch=4 loss=0.7437 TRAIN_METRICS phase=unlearn epoch=1 loss=1.5661 TRAIN_METRICS phase=unlearn epoch=2 loss=1.0425 mnist-class0: retain_acc=0.7485, forget_acc=0.0000, forget_mia_auc=0.6209, unlearn_score=0.7092 ### fmnist-class0 (scripts/fmnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.1445 TRAIN_METRICS phase=pretrain epoch=2 loss=1.2782 TRAIN_METRICS phase=pretrain epoch=3 loss=0.9637 TRAIN_METRICS phase=pretrain epoch=4 loss=0.8266 TRAIN_METRICS phase=unlearn epoch=1 loss=1.3593 TRAIN_METRICS phase=unlearn epoch=2 loss=1.2177 fmnist-class0: retain_acc=0.6367, forget_acc=0.0000, forget_mia_auc=0.5074, unlearn_score=0.7098 ### cifar10-class0 (scripts/cifar10_class0.sh) TRAIN_METRICS ph
Tool Input
1{
2 "content": "class UnlearningMethod:\n \"\"\"Reverse Cross-Entropy Unlearning: -log(1 - p_y). smoothly self-anneals as p_y -> 0.\"\"\"\n\n def __init__(self):\n self.forget_weight = 1.0\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n retain_logits = model(retain_x)\n retain_loss = F.cross_entropy(retain_logits, retain_y)\n\n forget_logits = model(forget_x)\n forget_probs = torch.softmax(forget_logits.double(), dim=1)\n forget_true_probs = forget_probs.gather(1, forget_y.view(-1, 1)).squeeze(1)\n # 1-p_y naturally caps the gradient. Using double prevents 1-p_y from being exactly 0 for reasonable logits.\n forget_loss = -torch.log(1.0 - forget_true_probs + 1e-12).mean().float()\n\n loss = retain_loss + self.forget_weight * forget_loss\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return {\"loss\": loss.item(), \"retain_loss\": retain_loss.item(), \"forget_loss\": forget_loss.item()}",
3 "after_line": 1,
4 "op": "replace",
5 "start_line": 9,
6 "end_line": 40,
7 "filename": "opacus/bench/unlearning/custom_unlearning.py"
8}OK: Replaced lines 9..40 with 26 line(s) in opacus/bench/unlearning/custom_unlearning.py. Editable range: entire file.
[Current file: opacus/bench/unlearning/custom_unlearning.py | editable: 1–39 | total: 39 lines]
1 """Editable unlearning method for MLS-Bench."""
2
3 import torch
4 import torch.nn.functional as F
5
6 # ============================================================
7 # EDITABLE
8 # ============================================================
9 class UnlearningMethod:
10 """Reverse Cross-Entropy Unlearning: -log(1 - p_y). smoothly self-anneals as p_y -> 0."""
11
12 def __init__(self):
13 self.forget_weight = 1.0
14
15 def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
16 retain_x, retain_y = retain_batch
17 forget_x, forget_y = forget_batch
18
19 retain_logits = model(retain_x)
20
[Test #3] (0 tests remaining) ## Seed 42 ### mnist-class0 (scripts/mnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.2914 TRAIN_METRICS phase=pretrain epoch=2 loss=1.8462 TRAIN_METRICS phase=pretrain epoch=3 loss=1.2810 TRAIN_METRICS phase=pretrain epoch=4 loss=0.7332 TRAIN_METRICS phase=unlearn epoch=1 loss=1.0490 TRAIN_METRICS phase=unlearn epoch=2 loss=0.7028 mnist-class0: retain_acc=0.7764, forget_acc=0.0000, forget_mia_auc=0.6030, unlearn_score=0.7245 ### fmnist-class0 (scripts/fmnist_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=2.1443 TRAIN_METRICS phase=pretrain epoch=2 loss=1.2959 TRAIN_METRICS phase=pretrain epoch=3 loss=0.9681 TRAIN_METRICS phase=pretrain epoch=4 loss=0.8331 TRAIN_METRICS phase=unlearn epoch=1 loss=0.9093 TRAIN_METRICS phase=unlearn epoch=2 loss=0.7806 fmnist-class0: retain_acc=0.7100, forget_acc=0.0000, forget_mia_auc=0.4865, unlearn_score=0.7412 ### cifar10-class0 (scripts/cifar10_class0.sh) TRAIN_METRICS phase=pretrain epoch=1 loss=
Tool Input
1{
2 "op": "replace",
3 "filename": "bench/unlearning/custom_unlearning.py",
4 "end_line": 24,
5 "after_line": 8,
6 "content": "class UnlearningMethod:\n \"\"\"Targeted Knowledge Distillation with Masked Forget Labels.\"\"\"\n\n def __init__(self):\n self.orig_model = None\n self.temperature = 2.0 \n self.forget_weight = 1.0\n self.retain_weight = 1.0\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n if self.orig_model is None:\n import copy\n self.orig_model = copy.deepcopy(model)\n self.orig_model.eval()\n for p in self.orig_model.parameters():\n p.requires_grad = False\n\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n # Retain step: CE + KD\n retain_logits = model(retain_x)\n with torch.no_grad():\n orig_retain_logits = self.orig_model(retain_x)\n \n loss_ce = F.cross_entropy(retain_logits, retain_y)\n loss_kd_retain = F.kl_div(\n F.log_softmax(retain_logits / self.temperature, dim=1),\n F.softmax(orig_retain_logits / self.temperature, dim=1),\n reduction='batchmean'\n ) * (self.temperature ** 2)\n retain_loss = 0.5 * loss_ce + 0.5 * loss_kd_retain\n\n # Forget step: Match orig model but without the true class\n forget_logits = model(forget_x)\n with torch.no_grad():\n orig_forget_logits = self.orig_model(forget_x).clone()\n orig_forget_logits.scatter_(1, forget_y.unsqueeze(1), -1e4) # Mask out forget class\n forget_target_probs = F.softmax(orig_forget_logits / self.temperature, dim=1)\n\n forget_loss = F.kl_div(\n F.log_softmax(forget_logits / self.temperature, dim=1),\n forget_target_probs,\n reduction='batchmean'\n ) * (self.temperature ** 2)\n\n loss = self.retain_weight * retain_loss + self.forget_weight * forget_loss\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return {\"loss\": loss.item(), \"retain_loss\": retain_loss.item(), \"forget_loss\": forget_loss.item()}",
7 "start_line": 9
8}ERROR: Package 'bench' is not in allowed packages
Tool Input
1{
2 "start_line": 9,
3 "content": "class UnlearningMethod:\n \"\"\"Targeted Knowledge Distillation with Masked Forget Labels.\"\"\"\n\n def __init__(self):\n self.orig_model = None\n self.temperature = 2.0 \n self.forget_weight = 1.0\n self.retain_weight = 1.0\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n if self.orig_model is None:\n import copy\n self.orig_model = copy.deepcopy(model)\n self.orig_model.eval()\n for p in self.orig_model.parameters():\n p.requires_grad = False\n\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n # Retain step: CE + KD\n retain_logits = model(retain_x)\n with torch.no_grad():\n orig_retain_logits = self.orig_model(retain_x)\n \n loss_ce = F.cross_entropy(retain_logits, retain_y)\n loss_kd_retain = F.kl_div(\n F.log_softmax(retain_logits / self.temperature, dim=1),\n F.softmax(orig_retain_logits / self.temperature, dim=1),\n reduction='batchmean'\n ) * (self.temperature ** 2)\n retain_loss = 0.5 * loss_ce + 0.5 * loss_kd_retain\n\n # Forget step: Match orig model but without the true class\n forget_logits = model(forget_x)\n with torch.no_grad():\n orig_forget_logits = self.orig_model(forget_x).clone()\n orig_forget_logits.scatter_(1, forget_y.unsqueeze(1), -1e4) # Mask out forget class\n forget_target_probs = F.softmax(orig_forget_logits / self.temperature, dim=1)\n\n forget_loss = F.kl_div(\n F.log_softmax(forget_logits / self.temperature, dim=1),\n forget_target_probs,\n reduction='batchmean'\n ) * (self.temperature ** 2)\n\n loss = self.retain_weight * retain_loss + self.forget_weight * forget_loss\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return {\"loss\": loss.item(), \"retain_loss\": retain_loss.item(), \"forget_loss\": forget_loss.item()}",
4 "end_line": 24,
5 "after_line": 8,
6 "op": "replace",
7 "filename": "pytorch-vision/bench/unlearning/custom_unlearning.py"
8}OK: Replaced lines 9..24 with 53 line(s) in pytorch-vision/bench/unlearning/custom_unlearning.py. Editable range: entire file.
[Current file: pytorch-vision/bench/unlearning/custom_unlearning.py | editable: 1–62 | total: 62 lines]
1 """Editable unlearning method for MLS-Bench."""
2
3 import torch
4 import torch.nn.functional as F
5
6 # ============================================================
7 # EDITABLE
8 # ============================================================
9 class UnlearningMethod:
10 """Targeted Knowledge Distillation with Masked Forget Labels."""
11
12 def __init__(self):
13 self.orig_model = None
14 self.temperature = 2.0
15 self.forget_weight = 1.0
16 self.retain_weight = 1.0
17
18 def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):
19 if self.orig_model is None:
20
Tool Input
1{
2 "end_line": 24,
3 "content": "class UnlearningMethod:\n \"\"\"Uses KD on retain set to preserve utility, and soft-relabeling on forget set.\"\"\"\n\n def __init__(self):\n self.ref_model = None\n self.T = 2.0\n self.alpha = 1.0 # weight for forget soft relabeling\n self.beta = 0.5 # weight for retain KD\n self.gamma = 0.05 # weight for logit constraint\n\n def unlearn_step(self, model, retain_batch, forget_batch, optimizer, step, epoch):\n if self.ref_model is None:\n import copy\n self.ref_model = copy.deepcopy(model)\n self.ref_model.eval()\n for param in self.ref_model.parameters():\n param.requires_grad = False\n\n retain_x, retain_y = retain_batch\n forget_x, forget_y = forget_batch\n\n # 1. Distillation on Retain Set\n with torch.no_grad():\n ref_retain_logits = self.ref_model(retain_x)\n \n retain_logits = model(retain_x)\n \n loss_retain_kd = F.kl_div(\n F.log_softmax(retain_logits / self.T, dim=1),\n F.softmax(ref_retain_logits / self.T, dim=1),\n reduction='batchmean'\n ) * (self.T ** 2)\n \n loss_retain_ce = F.cross_entropy(retain_logits, retain_y)\n \n # 2. Soft Relabeling on Forget Set\n with torch.no_grad():\n ref_forget_logits = self.ref_model(forget_x)\n masked_logits = ref_forget_logits.clone()\n masked_logits[torch.arange(forget_x.size(0)), forget_y] = -1e4\n forget_target_probs = F.softmax(masked_logits / self.T, dim=1)\n \n eps = 0.05\n num_classes = masked_logits.size(1)\n forget_target_probs = forget_target_probs * (1 - eps) + eps / num_classes\n \n forget_logits = model(forget_x)\n \n loss_forget = F.kl_div(\n F.log_softmax(forget_logits / self.T, dim=1),\n forget_target_probs,\n reduction='batchmean'\n ) * (self.T ** 2)\n \n loss_reg = (forget_logits ** 2).mean()\n \n loss = loss_retain_ce + self.beta * loss_retain_kd + self.alpha * loss_forget + self.gamma * loss_reg\n \n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n \n return {\n \"loss\": loss.item(),\n \"loss_retain_ce\": loss_retain_ce.item(),\n \"loss_retain_kd\": loss_retain_kd.item(),\n \"loss_forget\": loss_forget.item()\n }",
4 "filename": "bench/unlearning/custom_unlearning.py",
5 "op": "replace",
6 "start_line": 9
7}ERROR: Package 'bench' is not in allowed packages