optimization-bilevel

Optimizationpenalized-bilevel-gradient-descentrigorous codebase

Description

Optimization Bilevel

Research Question

Can you improve a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments by selecting a better supported method and tuning only paper-style strategy hyperparameters?

What You Can Modify

Edit only penalized-bilevel-gradient-descent/mlsbench/custom_strategy.py inside the editable block containing:

get_toy_strategy()
get_hyperclean_strategy(net)

These functions may only choose among the supported methods already implemented in the fixed driver:

Toy mode: v_pbgd, g_pbgd
Data hyper-cleaning mode: v_pbgd, g_pbgd, rhg, t_rhg

You should only change strategy-level choices already present in the paper/codebase, such as:

method selection
learning rates
penalty schedule (gamma_init, gamma_max, gamma_argmax_step)
inner / outer iteration counts
RHG truncation depth (K) and inner-loop length (T)

Do not rewrite the driver, dataset split, pollution protocol, metrics, or model architectures.

Fixed Setup

Toy / Numerical Verification

Problem definition follows Section 5.1 / 6.1 of the paper
x is projected to [0, 3]
1000 random initial points are sampled as in the official toy script
Primary metric: convergence_steps
Secondary metrics: success_rate, final_residual, runtime_sec

Data Hyper-Cleaning

MNIST split: 5000 train / 5000 validation / 10000 test
Pollution rate: 50%
Pollution logic follows the released official code
Models: linear classifier and 2-layer MLP (784 -> 300 -> 10, sigmoid hidden layer)
Primary metric: test_accuracy
Secondary metrics: f1_score, cleaner precision / recall, runtime to best accuracy

Reference Files

The following official source files are provided read-only for fidelity:

penalized-bilevel-gradient-descent/V-PBGD/toy/toy.py
penalized-bilevel-gradient-descent/V-PBGD/data-hyper-cleaning/data_hyper_clean.py
penalized-bilevel-gradient-descent/G-PBGD/data_hyper_clean_gpbgd.py
penalized-bilevel-gradient-descent/RHG/data_hyper_clean_rhg.py
penalized-bilevel-gradient-descent/RHG/hypergrad/hypergradients.py

Evaluation

The task runs three benchmark commands:

toy-convergence
hyperclean-linear
hyperclean-mlp

Each command prints structured TRAIN_METRICS and FINAL_METRICS lines. The parser records the final metrics separately for each command label.

Code

custom_strategy.py

EditableRead-only

1"""Optimization-bilevel scaffold for MLS-Bench.
2
3The fixed driver reproduces the numerical verification and data hyper-cleaning
4experiments from Shen and Chen, "On Penalty-based Bilevel Gradient Descent
5Method" (ICML 2023 / Mathematical Programming 2025) while exposing only the
6method choice and official hyperparameters as editable strategy hooks.
7"""
8
9from __future__ import annotations
10
11import argparse
12import json
13import math
14import os
15import random

Additional context files (read-only):

penalized-bilevel-gradient-descent/RHG/hypergrad/hypergradients.py

Results

Show per-seed results

Model	Type	convergence steps toy convergence ↓	final residual toy convergence ↓	final projected grad toy convergence ↓	success rate toy convergence ↑	test accuracy hyperclean linear ↑	f1 score hyperclean linear ↑	cleaner precision hyperclean linear ↑	cleaner recall hyperclean linear ↑	test accuracy hyperclean mlp ↑	f1 score hyperclean mlp ↑	cleaner precision hyperclean mlp ↑	cleaner recall hyperclean mlp ↑
g_pbgd	baseline	303.686	0.081	0.000	1.000	89.837	80.629	0.839	0.776	92.380	90.820	0.889	0.928
rhg	baseline	260.712	0.030	0.000	1.000	84.633	89.547	0.832	0.969	84.790	89.339	0.822	0.979
t_rhg	baseline	260.712	0.030	0.000	1.000	84.613	89.059	0.828	0.964	84.790	89.348	0.824	0.976
v_pbgd	baseline	260.712	0.030	0.000	1.000	90.097	91.722	0.885	0.952	91.480	92.050	0.887	0.956
anthropic/claude-opus-4.6	vanilla	147.298	0.033	0.000	1.000	-	-	-	-	92.410	91.030	0.890	0.932
deepseek-reasoner	vanilla	261.256	0.030	0.000	1.000	90.080	91.811	0.884	0.955	91.480	92.050	0.887	0.956
google/gemini-3.1-pro-preview	vanilla	147.298	0.033	0.000	1.000	90.090	91.771	0.882	0.957	92.190	90.852	0.888	0.930
openai/gpt-5.4-pro	vanilla	261.256	0.030	0.000	1.000	90.080	91.811	0.884	0.955	92.380	90.820	0.889	0.928
qwen3.6-plus:free	vanilla	20000.000	0.547	16.683	0.000	89.500	60.473	0.831	0.475	91.450	91.103	0.886	0.938
anthropic/claude-opus-4.6	agent	147.298	0.033	0.000	1.000	90.100	91.782	0.883	0.956	92.640	91.738	0.890	0.946
deepseek-reasoner	agent	7363.511	0.151	3.812	0.634	89.580	91.393	0.865	0.968	91.370	91.223	0.886	0.940
google/gemini-3.1-pro-preview	agent	3374.788	0.142	3.663	0.835	-	-	-	-	93.190	89.631	0.890	0.902
openai/gpt-5.4-pro	agent	261.256	0.030	0.000	1.000	90.080	91.811	0.884	0.955	92.550	91.154	0.890	0.934
qwen3.6-plus:free	agent	261.256	0.030	0.000	1.000	-	-	-	-	92.380	90.820	0.889	0.928
qwen3.6-plus:free	agent	129.169	0.067	0.000	1.000	90.080	91.811	0.884	0.955	92.380	90.820	0.889	0.928

Agent Conversations

anthropic/claude-opus-4.6

6 steps

deepseek-reasoner

5 steps

google/gemini-3.1-pro-preview

6 steps

openai/gpt-5.4-pro

6 steps