optimization-parity
Optimizationpytorch-examplesrigorous codebase
Description
Optimization Parity
Research Question
Can you improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters?
What You Can Modify
Edit the scaffold file pytorch-examples/optimization_parity/custom_strategy.py only inside the editable block containing:
init_model(model, config)make_dataset(secret, config, seed)get_optimizer_config(config)
The benchmark is evaluated on three configurations: (N=32, K=8), (N=50, K=8), and (N=64, K=8), all with W=512.
Fixed Setup
- Task:
y = (sum_{i in S} x_i) mod 2for a hidden secret subsetS - Inputs: binary vectors
x in {0,1}^N - Model:
Linear(N, W) -> ReLU -> Linear(W, 1) -> Sigmoid - Optimizer type:
AdamW - Loss: binary cross-entropy
- Batch size:
128 - Training budget: up to
100000steps, reshuffling every epoch - Evaluation: 10 hidden secrets, 10 random epoch-orderings per secret, mean held-out test accuracy over all 100 runs
Interface Notes
init_model(...)must not depend on the hidden secret.make_dataset(...)may use the provided secret and must return either(x, y)or{"x": x, "y": y}.xmust have shape[num_examples, N]with binary values only.ymust have shape[num_examples](or[num_examples, 1]) with binary labels.- Training dataset size must stay
<= 12_800_000examples. get_optimizer_config(...)must returnlr,wd,beta1, andbeta2.
Metric
The leaderboard metric is test_accuracy (also emitted as score), the mean test accuracy across all 100 training runs.
Code
custom_strategy.py
EditableRead-only
1"""Optimization-parity scaffold for MLS-Bench.23The fixed evaluation samples hidden sparse parity functions and asks the agent4to control only:51. model initialization62. training-data generation73. AdamW hyperparameters8"""910from __future__ import annotations1112import argparse13import json14import math15import random
Results
| Model | Type | test accuracy n32-k8 ↑ | score n32-k8 ↑ | test accuracy n50-k8 ↑ | score n50-k8 ↑ | test accuracy n64-k8 ↑ | score n64-k8 ↑ |
|---|---|---|---|---|---|---|---|
| default | baseline | 0.747 | 0.747 | 0.500 | 0.500 | 0.502 | 0.502 |
| multi_epoch | baseline | 0.506 | 0.506 | 0.500 | 0.500 | 0.504 | 0.504 |
| nowd | baseline | 0.861 | 0.861 | 0.501 | 0.501 | 0.501 | 0.501 |
| anthropic/claude-opus-4.6 | vanilla | 0.507 | 0.507 | 0.500 | 0.500 | 0.499 | 0.499 |
| deepseek-reasoner | vanilla | 0.508 | 0.508 | 0.498 | 0.498 | 0.499 | 0.499 |
| google/gemini-3.1-pro-preview | vanilla | 1.000 | 1.000 | 0.969 | 0.969 | 0.985 | 0.985 |
| openai/gpt-5.4-pro | vanilla | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 |
| qwen3.6-plus:free | vanilla | - | - | 0.501 | 0.501 | 0.501 | 0.501 |
| anthropic/claude-opus-4.6 | agent | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| deepseek-reasoner | agent | 0.678 | 0.678 | 0.500 | 0.500 | 0.500 | 0.500 |
| google/gemini-3.1-pro-preview | agent | 1.000 | 1.000 | 0.984 | 0.984 | 0.853 | 0.853 |
| openai/gpt-5.4-pro | agent | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| qwen3.6-plus:free | agent | 0.500 | 0.500 | 0.499 | 0.499 | 0.501 | 0.501 |