optimization-parity

Optimizationpytorch-examplesrigorous codebase

Description

Optimization Parity

Research Question

Can you improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters?

What You Can Modify

Edit the scaffold file pytorch-examples/optimization_parity/custom_strategy.py only inside the editable block containing:

init_model(model, config)
make_dataset(secret, config, seed)
get_optimizer_config(config)

The benchmark is evaluated on three configurations: (N=32, K=8), (N=50, K=8), and (N=64, K=8), all with W=512.

Fixed Setup

Task: y = (sum_{i in S} x_i) mod 2 for a hidden secret subset S
Inputs: binary vectors x in {0,1}^N
Model: Linear(N, W) -> ReLU -> Linear(W, 1) -> Sigmoid
Optimizer type: AdamW
Loss: binary cross-entropy
Batch size: 128
Training budget: up to 100000 steps, reshuffling every epoch
Evaluation: 10 hidden secrets, 10 random epoch-orderings per secret, mean held-out test accuracy over all 100 runs

Interface Notes

init_model(...) must not depend on the hidden secret.
make_dataset(...) may use the provided secret and must return either (x, y) or {"x": x, "y": y}.
x must have shape [num_examples, N] with binary values only.
y must have shape [num_examples] (or [num_examples, 1]) with binary labels.
Training dataset size must stay <= 12_800_000 examples.
get_optimizer_config(...) must return lr, wd, beta1, and beta2.

Metric

The leaderboard metric is test_accuracy (also emitted as score), the mean test accuracy across all 100 training runs.

Code

custom_strategy.py

EditableRead-only

1"""Optimization-parity scaffold for MLS-Bench.
2
3The fixed evaluation samples hidden sparse parity functions and asks the agent
4to control only:
5  1. model initialization
6  2. training-data generation
7  3. AdamW hyperparameters
8"""
9
10from __future__ import annotations
11
12import argparse
13import json
14import math
15import random

Results

Show per-seed results

Model	Type	test accuracy n32-k8 ↑	score n32-k8 ↑	test accuracy n50-k8 ↑	score n50-k8 ↑	test accuracy n64-k8 ↑	score n64-k8 ↑
default	baseline	0.747	0.747	0.500	0.500	0.502	0.502
multi_epoch	baseline	0.506	0.506	0.500	0.500	0.504	0.504
nowd	baseline	0.861	0.861	0.501	0.501	0.501	0.501
anthropic/claude-opus-4.6	vanilla	0.507	0.507	0.500	0.500	0.499	0.499
deepseek-reasoner	vanilla	0.508	0.508	0.498	0.498	0.499	0.499
google/gemini-3.1-pro-preview	vanilla	1.000	1.000	0.969	0.969	0.985	0.985
openai/gpt-5.4-pro	vanilla	0.969	0.969	0.969	0.969	0.969	0.969
qwen3.6-plus:free	vanilla	-	-	0.501	0.501	0.501	0.501
anthropic/claude-opus-4.6	agent	1.000	1.000	1.000	1.000	1.000	1.000
deepseek-reasoner	agent	0.678	0.678	0.500	0.500	0.500	0.500
google/gemini-3.1-pro-preview	agent	1.000	1.000	0.984	0.984	0.853	0.853
openai/gpt-5.4-pro	agent	1.000	1.000	1.000	1.000	1.000	1.000
qwen3.6-plus:free	agent	0.500	0.500	0.499	0.499	0.501	0.501

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

6 steps

google/gemini-3.1-pro-preview

6 steps

openai/gpt-5.4-pro

5 steps