optimization-bilevel
Description
Optimization Bilevel
Research Question
Can you improve a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments by selecting a better supported method and tuning only paper-style strategy hyperparameters?
What You Can Modify
Edit only penalized-bilevel-gradient-descent/mlsbench/custom_strategy.py inside the editable block containing:
get_toy_strategy()get_hyperclean_strategy(net)
These functions may only choose among the supported methods already implemented in the fixed driver:
- Toy mode:
v_pbgd,g_pbgd - Data hyper-cleaning mode:
v_pbgd,g_pbgd,rhg,t_rhg
You should only change strategy-level choices already present in the paper/codebase, such as:
- method selection
- learning rates
- penalty schedule (
gamma_init,gamma_max,gamma_argmax_step) - inner / outer iteration counts
- RHG truncation depth (
K) and inner-loop length (T)
Do not rewrite the driver, dataset split, pollution protocol, metrics, or model architectures.
Fixed Setup
Toy / Numerical Verification
- Problem definition follows Section 5.1 / 6.1 of the paper
xis projected to[0, 3]- 1000 random initial points are sampled as in the official toy script
- Primary metric:
convergence_steps - Secondary metrics:
success_rate,final_residual,runtime_sec
Data Hyper-Cleaning
- MNIST split: 5000 train / 5000 validation / 10000 test
- Pollution rate: 50%
- Pollution logic follows the released official code
- Models: linear classifier and 2-layer MLP (
784 -> 300 -> 10, sigmoid hidden layer) - Primary metric:
test_accuracy - Secondary metrics:
f1_score, cleaner precision / recall, runtime to best accuracy
Reference Files
The following official source files are provided read-only for fidelity:
penalized-bilevel-gradient-descent/V-PBGD/toy/toy.pypenalized-bilevel-gradient-descent/V-PBGD/data-hyper-cleaning/data_hyper_clean.pypenalized-bilevel-gradient-descent/G-PBGD/data_hyper_clean_gpbgd.pypenalized-bilevel-gradient-descent/RHG/data_hyper_clean_rhg.pypenalized-bilevel-gradient-descent/RHG/hypergrad/hypergradients.py
Evaluation
The task runs three benchmark commands:
toy-convergencehyperclean-linearhyperclean-mlp
Each command prints structured TRAIN_METRICS and FINAL_METRICS lines. The parser records the final metrics separately for each command label.
Code
1"""Optimization-bilevel scaffold for MLS-Bench.23The fixed driver reproduces the numerical verification and data hyper-cleaning4experiments from Shen and Chen, "On Penalty-based Bilevel Gradient Descent5Method" (ICML 2023 / Mathematical Programming 2025) while exposing only the6method choice and official hyperparameters as editable strategy hooks.7"""89from __future__ import annotations1011import argparse12import json13import math14import os15import random
Additional context files (read-only):
penalized-bilevel-gradient-descent/RHG/hypergrad/hypergradients.py
Results
| Model | Type | convergence steps toy convergence ↓ | final residual toy convergence ↓ | final projected grad toy convergence ↓ | success rate toy convergence ↑ | test accuracy hyperclean linear ↑ | f1 score hyperclean linear ↑ | cleaner precision hyperclean linear ↑ | cleaner recall hyperclean linear ↑ | test accuracy hyperclean mlp ↑ | f1 score hyperclean mlp ↑ | cleaner precision hyperclean mlp ↑ | cleaner recall hyperclean mlp ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| g_pbgd | baseline | 303.686 | 0.081 | 0.000 | 1.000 | 89.837 | 80.629 | 0.839 | 0.776 | 92.380 | 90.820 | 0.889 | 0.928 |
| rhg | baseline | 260.712 | 0.030 | 0.000 | 1.000 | 84.633 | 89.547 | 0.832 | 0.969 | 84.790 | 89.339 | 0.822 | 0.979 |
| t_rhg | baseline | 260.712 | 0.030 | 0.000 | 1.000 | 84.613 | 89.059 | 0.828 | 0.964 | 84.790 | 89.348 | 0.824 | 0.976 |
| v_pbgd | baseline | 260.712 | 0.030 | 0.000 | 1.000 | 90.097 | 91.722 | 0.885 | 0.952 | 91.480 | 92.050 | 0.887 | 0.956 |
| anthropic/claude-opus-4.6 | vanilla | 147.298 | 0.033 | 0.000 | 1.000 | - | - | - | - | 92.410 | 91.030 | 0.890 | 0.932 |
| deepseek-reasoner | vanilla | 261.256 | 0.030 | 0.000 | 1.000 | 90.080 | 91.811 | 0.884 | 0.955 | 91.480 | 92.050 | 0.887 | 0.956 |
| google/gemini-3.1-pro-preview | vanilla | 147.298 | 0.033 | 0.000 | 1.000 | 90.090 | 91.771 | 0.882 | 0.957 | 92.190 | 90.852 | 0.888 | 0.930 |
| openai/gpt-5.4-pro | vanilla | 261.256 | 0.030 | 0.000 | 1.000 | 90.080 | 91.811 | 0.884 | 0.955 | 92.380 | 90.820 | 0.889 | 0.928 |
| qwen3.6-plus:free | vanilla | 20000.000 | 0.547 | 16.683 | 0.000 | 89.500 | 60.473 | 0.831 | 0.475 | 91.450 | 91.103 | 0.886 | 0.938 |
| anthropic/claude-opus-4.6 | agent | 147.298 | 0.033 | 0.000 | 1.000 | 90.100 | 91.782 | 0.883 | 0.956 | 92.640 | 91.738 | 0.890 | 0.946 |
| deepseek-reasoner | agent | 7363.511 | 0.151 | 3.812 | 0.634 | 89.580 | 91.393 | 0.865 | 0.968 | 91.370 | 91.223 | 0.886 | 0.940 |
| google/gemini-3.1-pro-preview | agent | 3374.788 | 0.142 | 3.663 | 0.835 | - | - | - | - | 93.190 | 89.631 | 0.890 | 0.902 |
| openai/gpt-5.4-pro | agent | 261.256 | 0.030 | 0.000 | 1.000 | 90.080 | 91.811 | 0.884 | 0.955 | 92.550 | 91.154 | 0.890 | 0.934 |
| qwen3.6-plus:free | agent | 261.256 | 0.030 | 0.000 | 1.000 | - | - | - | - | 92.380 | 90.820 | 0.889 | 0.928 |
| qwen3.6-plus:free | agent | 129.169 | 0.067 | 0.000 | 1.000 | 90.080 | 91.811 | 0.884 | 0.955 | 92.380 | 90.820 | 0.889 | 0.928 |