optimization-parity

Optimizationpytorch-examplesrigorous codebase

Description

Optimization Parity

Research Question

Can you improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters?

What You Can Modify

Edit the scaffold file pytorch-examples/optimization_parity/custom_strategy.py only inside the editable block containing:

  1. init_model(model, config)
  2. make_dataset(secret, config, seed)
  3. get_optimizer_config(config)

The benchmark is evaluated on three configurations: (N=32, K=8), (N=50, K=8), and (N=64, K=8), all with W=512.

Fixed Setup

  • Task: y = (sum_{i in S} x_i) mod 2 for a hidden secret subset S
  • Inputs: binary vectors x in {0,1}^N
  • Model: Linear(N, W) -> ReLU -> Linear(W, 1) -> Sigmoid
  • Optimizer type: AdamW
  • Loss: binary cross-entropy
  • Batch size: 128
  • Training budget: up to 100000 steps, reshuffling every epoch
  • Evaluation: 10 hidden secrets, 10 random epoch-orderings per secret, mean held-out test accuracy over all 100 runs

Interface Notes

  • init_model(...) must not depend on the hidden secret.
  • make_dataset(...) may use the provided secret and must return either (x, y) or {"x": x, "y": y}.
  • x must have shape [num_examples, N] with binary values only.
  • y must have shape [num_examples] (or [num_examples, 1]) with binary labels.
  • Training dataset size must stay <= 12_800_000 examples.
  • get_optimizer_config(...) must return lr, wd, beta1, and beta2.

Metric

The leaderboard metric is test_accuracy (also emitted as score), the mean test accuracy across all 100 training runs.

Code

custom_strategy.py
EditableRead-only
1"""Optimization-parity scaffold for MLS-Bench.
2
3The fixed evaluation samples hidden sparse parity functions and asks the agent
4to control only:
5 1. model initialization
6 2. training-data generation
7 3. AdamW hyperparameters
8"""
9
10from __future__ import annotations
11
12import argparse
13import json
14import math
15import random

Results

ModelTypetest accuracy n32-k8 score n32-k8 test accuracy n50-k8 score n50-k8 test accuracy n64-k8 score n64-k8
defaultbaseline0.7470.7470.5000.5000.5020.502
multi_epochbaseline0.5060.5060.5000.5000.5040.504
nowdbaseline0.8610.8610.5010.5010.5010.501
anthropic/claude-opus-4.6vanilla0.5070.5070.5000.5000.4990.499
deepseek-reasonervanilla0.5080.5080.4980.4980.4990.499
google/gemini-3.1-pro-previewvanilla1.0001.0000.9690.9690.9850.985
openai/gpt-5.4-provanilla0.9690.9690.9690.9690.9690.969
qwen3.6-plus:freevanilla--0.5010.5010.5010.501
anthropic/claude-opus-4.6agent1.0001.0001.0001.0001.0001.000
deepseek-reasoneragent0.6780.6780.5000.5000.5000.500
google/gemini-3.1-pro-previewagent1.0001.0000.9840.9840.8530.853
openai/gpt-5.4-proagent1.0001.0001.0001.0001.0001.000
qwen3.6-plus:freeagent0.5000.5000.4990.4990.5010.501

Agent Conversations