Agent Conversation: google/gemini-3.1-pro-preview

Model: google/gemini-3.1-pro-preview
Total steps: 6
Messages: 22
UserStep 0

Task: optimization-parity

Optimization Parity

Research Question

Can you improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters?

What You Can Modify

Edit the scaffold file pytorch-examples/optimization_parity/custom_strategy.py only inside the editable block containing:

  1. init_model(model, config)
  2. make_dataset(secret, config, seed)
  3. get_optimizer_config(config)

The benchmark is evaluated on three configurations: (N=32, K=8), (N=50, K=8), and (N=64, K=8), all with W=512.

Fixed Setup

  • Task: y = (sum_{i in S} x_i) mod 2 for a hidden secret subset S
  • Inputs: binary vectors x in {0,1}^N
  • Model: Linear(N, W) -> ReLU -> Linear(W, 1) -> Sigmoid
  • Optimizer type: AdamW
  • Loss: binary cross-entropy
  • Batch size: 128
  • Training budget: up to 100000 steps, reshuffling every epoch
  • Evaluation: 10 hidden secrets, 10 random epoch-orderings per secret, mean held-out test accuracy over all 100 runs

Interface Notes

  • init_model(...) must not depend on the hidden secret.
  • make_dataset(...) may use the provided secret and must return either (x, y) or {"x": x, "y": y}.
  • x must have shape [num_examples, N] with binary values only.
  • y must have shape [num_examples] (or [num_examples, 1]) with binary labels.
  • Training dataset size must stay <= 12_800_000 examples.
  • get_optimizer_config(...) must return lr, wd, beta1, and beta2.

Metric

The leaderboard metric is test_accuracy (also emitted as score), the mean test accuracy across all 100 training runs.

Hints

  • The dataset generator knows the sampled secret even though the agent writing the code does not.
  • Balanced or structured coverage of the secret coordinates can matter more than raw dataset size.
  • The benchmark driver already handles training, evaluation, reproducibility, and logging.

pytorch-examples/optimization_parity/custom_strategy.py [EDITABLE — lines 220–255 only]

     1: """Optimization-parity scaffold for MLS-Bench.
     2: 
     3: The fixed evaluation samples hidden sparse parity functions and asks the agent
     4: to control only:
     5:   1. model initialization
     6:   2. training-data generation
     7:   3. AdamW hyperparameters
     8: """
     9: 
    10: from __future__ import annotations
    11: 
    12: import argparse
    13: import json
    14: import math
    15: import random
    16: from dataclasses import asdict, dataclass, replace
    17: from pathlib import Path
    18: 
    19: import torch
    20: from torch import nn
    21: 
    22: 
    23: # =====================================================================
    24: # FIXED: Benchmark configuration
    25: # =====================================================================
    26: @dataclass(frozen=True)
    27: class TaskConfig:
    28:     n_features: int = 32
    29:     secret_size: int = 8
    30:     hidd
AssistantStep 1edit
Tool Input
1{
2  "end_line": 255,
3  "op": "replace",
4  "content": "def init_model(model: nn.Sequential, config: TaskConfig) -> None:\n    \"\"\"Initialize the fixed two-layer MLP without using the hidden secret.\"\"\"\n    for layer in model:\n        if isinstance(layer, nn.Linear):\n            gain = nn.init.calculate_gain(\"relu\") if layer is model[0] else 1.0\n            nn.init.xavier_uniform_(layer.weight, gain=gain)\n            nn.init.zeros_(layer.bias)\n\ndef make_dataset(\n    secret: tuple[int, ...],\n    config: TaskConfig,\n    seed: int,\n) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Return a dataset with perfectly balanced secret features and random noise for the rest.\"\"\"\n    generator = torch.Generator().manual_seed(seed)\n    \n    K = len(secret)\n    num_patterns = 1 << K\n    \n    # Create 102,400 examples\n    num_examples = num_patterns * 400\n    \n    x = torch.randint(\n        low=0,\n        high=2,\n        size=(num_examples, config.n_features),\n        generator=generator,\n        dtype=torch.float32,\n    )\n    \n    # Generate all secret patterns\n    patterns = torch.arange(num_patterns, dtype=torch.int64).unsqueeze(1)\n    patterns = (patterns >> torch.arange(K, dtype=torch.int64)) & 1\n    patterns = patterns.to(torch.float32)\n    patterns = patterns.repeat(num_examples // num_patterns, 1)\n    \n    secret_tensor = torch.tensor(secret, dtype=torch.long)\n    x[:, secret_tensor] = patterns\n    \n    y = parity_labels(x, secret)\n    return x, y\n\ndef get_optimizer_config(config: TaskConfig) -> dict[str, float]:\n    \"\"\"Return AdamW hyperparameters for the fixed training loop.\"\"\"\n    return {\n        \"lr\": 5e-3,\n        \"wd\": 1e-3,\n        \"beta1\": 0.9,\n        \"beta2\": 0.999,\n    }\n",
5  "start_line": 220,
6  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
7  "after_line": 219
8}
Tool ResultStep 1
OK: Replaced lines 220..255 with 50 line(s) in pytorch-examples/optimization_parity/custom_strategy.py. Editable range: 220–269.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–269 | total: 553 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP without using the hidden secret."""
   222      for layer in model:
   223          if isinstance(layer, nn.Linear):
   224              gain = nn.init.calculate_gain("relu") if layer is model[0] else 1.0
   225              nn.init.xavier_uniform_(layer.weight, gain=gain)
   226              nn.init.zeros_(layer.bias)
   227  
   228  def make_dataset(
   229      secret: tuple[int, ...],
   230      con
AssistantStep 2test
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "end_line": 257,
4  "start_line": 220,
5  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
6  "content": "def test_script():\n    pass",
7  "after_line": 1
8}
Tool ResultStep 1
ERROR: Lines 220..257 exceed the editable range. You may only edit lines 220–255.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–255 | total: 539 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP without using the hidden secret."""
   222      for layer in model:
   223          if isinstance(layer, nn.Linear):
   224              gain = nn.init.calculate_gain("relu") if layer is model[0] else 1.0
   225              nn.init.xavier_uniform_(layer.weight, gain=gain)
   226              nn.init.zeros_(layer.bias)
   227  
   228  
   229  def make_dataset(
   230      secret: tuple[int, ...],
   231      config: TaskConfig,
   232      seed: int
AssistantStep 2edit
Tool Input
1{
2  "op": "replace",
3  "start_line": 220,
4  "after_line": 219,
5  "end_line": 255,
6  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
7  "content": "def init_model(model: nn.Sequential, config: TaskConfig) -> None:\n    \"\"\"Initialize the fixed two-layer MLP.\"\"\"\n    for layer in model:\n        if isinstance(layer, nn.Linear):\n            gain = nn.init.calculate_gain(\"relu\") if layer is model[0] else 1.0\n            nn.init.xavier_uniform_(layer.weight, gain=gain)\n            nn.init.zeros_(layer.bias)\n\ndef make_dataset(\n    secret: tuple[int, ...],\n    config: TaskConfig,\n    seed: int,\n) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Perfect Cartesian product for secret bits to eliminate gradient noise.\"\"\"\n    generator = torch.Generator().manual_seed(seed)\n    K = len(secret)\n    # 3_840_000 is 30,000 steps * 128 batch_size.\n    # 3_840_000 // 256 = 15000.\n    num_patterns = 15000 \n    \n    # Random non-secret features\n    x_base = torch.randint(\n        low=0,\n        high=2,\n        size=(num_patterns, 1, config.n_features),\n        generator=generator,\n        dtype=torch.int64,\n    )\n    \n    # All combinations for secret bits\n    # It's an iterator of size 2^K x K\n    import itertools\n    combs_list = list(itertools.product([0, 1], repeat=K))\n    combs = torch.tensor(combs_list, dtype=torch.int64) # [2^K, K]\n    \n    x = x_base.repeat(1, 1 << K, 1) # [num_patterns, 2^K, N]\n    combs = combs.unsqueeze(0).repeat(num_patterns, 1, 1) # [num_patterns, 2^K, K]\n    \n    for i, sec_idx in enumerate(secret):\n        x[:, :, sec_idx] = combs[:, :, i]\n        \n    x = x.view(-1, config.n_features).to(torch.float32)\n    y = parity_labels(x, secret)\n    return x, y\n\ndef get_optimizer_config(config: TaskConfig) -> dict[str, float]:\n    return {\n        \"lr\": 2e-3,\n        \"wd\": 1e-1,\n        \"beta1\": 0.9,\n        \"beta2\": 0.999,\n    }"
8}
Tool ResultStep 2
OK: Replaced lines 220..255 with 52 line(s) in pytorch-examples/optimization_parity/custom_strategy.py. Editable range: 220–271.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–271 | total: 555 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP."""
   222      for layer in model:
   223          if isinstance(layer, nn.Linear):
   224              gain = nn.init.calculate_gain("relu") if layer is model[0] else 1.0
   225              nn.init.xavier_uniform_(layer.weight, gain=gain)
   226              nn.init.zeros_(layer.bias)
   227  
   228  def make_dataset(
   229      secret: tuple[int, ...],
   230      config: TaskConfig,
   231      see
AssistantStep 3test
Tool ResultStep 3
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### n32-k8 (scripts/eval.sh)
[COMMAND FAILED exit=1]
TASK_CONFIG N=32 K=8 W=512 num_hidden_secrets=5 num_orderings=3 test_set_size=16384 batch_size=128 max_steps=30000
DATASET_METRICS secret=0 num_examples=3840000 positive_rate=0.500000
Traceback (most recent call last):
  File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/optimization-parity/google_gemini-3.1-pro-preview_20260405_233300/pytorch-examples/optimization_parity/custom_strategy.py", line 555, in <module>
    main()
  File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/optimization-parity/google_gemini-3.1-pro-preview_20260405_233300/pytorch-examples/optimization_parity/custom_strategy.py", line 546, in main
    summary = run_benchmark(config=config, seed=args.seed, device=device)
  File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/optimization-parity/google_gemini-3.1-pro-preview_20260405_233300/pytorch-examples/optimization_parity/custom_strategy.py"
AssistantStep 1edit
Tool Input
1{
2  "start_line": 220,
3  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
4  "end_line": 255,
5  "after_line": 219,
6  "op": "replace",
7  "content": "def init_model(model: nn.Sequential, config: TaskConfig) -> None:\n    \"\"\"Initialize the fixed two-layer MLP without using the hidden secret.\"\"\"\n    for layer in model:\n        if isinstance(layer, nn.Linear):\n            gain = nn.init.calculate_gain(\"relu\") if layer is model[0] else 1.0\n            nn.init.xavier_uniform_(layer.weight, gain=gain)\n            if layer is model[0]:\n                nn.init.normal_(layer.bias, mean=0.0, std=1.0)\n            else:\n                nn.init.zeros_(layer.bias)\n\n\ndef make_dataset(\n    secret: tuple[int, ...],\n    config: TaskConfig,\n    seed: int,\n) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Return a reproducible training dataset for one hidden secret.\"\"\"\n    generator = torch.Generator().manual_seed(seed)\n    S = len(secret)\n\n    # 1. Perfectly balanced uniform subset\n    repeats = 1000  # 1000 * 256 = 256,000 examples\n    grid = torch.cartesian_prod(*[torch.tensor([0, 1])] * S)\n    grid = grid.repeat(repeats, 1)\n    \n    x_unif = torch.randint(0, 2, size=(repeats * (2**S), config.n_features), generator=generator)\n    for idx, sec_idx in enumerate(secret):\n        x_unif[:, sec_idx] = grid[:, idx]\n\n    # 2. Skewed subset to provide gradient signal (p=0.1 on secret features)\n    num_skewed = 256_000\n    x_skew = torch.randint(0, 2, size=(num_skewed, config.n_features), generator=generator)\n    skewed_vals = (torch.rand(num_skewed, S, generator=generator) < 0.1).int()\n    for idx, sec_idx in enumerate(secret):\n        x_skew[:, sec_idx] = skewed_vals[:, idx]\n\n    x = torch.cat([x_unif, x_skew], dim=0).to(torch.float32)\n    y = parity_labels(x, secret)\n    return x, y\n\n\ndef get_optimizer_config(config: TaskConfig) -> dict[str, float]:\n    \"\"\"Return AdamW hyperparameters for the fixed training loop.\"\"\"\n    return {\n        \"lr\": 2e-3,\n        \"wd\": 1e-2,\n        \"beta1\": 0.9,\n        \"beta2\": 0.999,\n    }"
8}
Tool ResultStep 1
OK: Replaced lines 220..255 with 50 line(s) in pytorch-examples/optimization_parity/custom_strategy.py. Editable range: 220–269.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–269 | total: 553 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP without using the hidden secret."""
   222      for layer in model:
   223          if isinstance(layer, nn.Linear):
   224              gain = nn.init.calculate_gain("relu") if layer is model[0] else 1.0
   225              nn.init.xavier_uniform_(layer.weight, gain=gain)
   226              if layer is model[0]:
   227                  nn.init.normal_(layer.bias, mean=0.0, std=1.0)
   228              e
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)


## Seed 42

### n32-k8 (scripts/eval.sh)
Training progress:
TRAIN_METRICS secret=4 order=2 step=7000 loss=0.024595 acc=0.998781
TRAIN_METRICS secret=4 order=2 step=7250 loss=0.022150 acc=0.999594
TRAIN_METRICS secret=4 order=2 step=7500 loss=0.019202 acc=0.999750
TRAIN_METRICS secret=4 order=2 step=7750 loss=0.016531 acc=0.999938
TRAIN_METRICS secret=4 order=2 step=8000 loss=0.014315 acc=1.000000

Recent runs:
RUN_METRICS secret=3 order=1 steps=9250 test_accuracy=1.000000
RUN_METRICS secret=3 order=2 steps=11250 test_accuracy=1.000000
RUN_METRICS secret=4 order=0 steps=10750 test_accuracy=1.000000
RUN_METRICS secret=4 order=1 steps=9750 test_accuracy=0.998901
RUN_METRICS secret=4 order=2 steps=8000 test_accuracy=1.000000

Final metrics:
FINAL_METRICS test_accuracy=0.999878 score=0.999878 test_accuracy_std=0.000297 mean_steps=11916.666667 num_runs=15

### n50-k8 (scripts/eval_n50_k8.sh)
Training progress:
TRAIN_METR
AssistantStep 3edit
Tool Input
1{
2  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
3  "end_line": 269,
4  "after_line": 219,
5  "content": "def init_model(model: nn.Sequential, config: TaskConfig) -> None:\n    \"\"\"Initialize the fixed two-layer MLP without using the hidden secret.\"\"\"\n    for layer in model:\n        if isinstance(layer, nn.Linear):\n            gain = nn.init.calculate_gain(\"relu\") if layer is model[0] else 1.0\n            nn.init.xavier_uniform_(layer.weight, gain=gain)\n            if layer is model[0]:\n                nn.init.normal_(layer.bias, mean=0.0, std=1.0)\n            else:\n                nn.init.zeros_(layer.bias)\n\n\ndef make_dataset(\n    secret: tuple[int, ...],\n    config: TaskConfig,\n    seed: int,\n) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Return a reproducible training dataset for one hidden secret.\"\"\"\n    generator = torch.Generator().manual_seed(seed)\n    S = len(secret)\n\n    # 1. Perfectly balanced uniform subset\n    repeats = 4000  # 4000 * 256 = 1,024,000 examples\n    grid = torch.cartesian_prod(*[torch.tensor([0, 1])] * S)\n    grid = grid.repeat(repeats, 1)\n    \n    x_unif = torch.randint(0, 2, size=(repeats * (2**S), config.n_features), generator=generator)\n    for idx, sec_idx in enumerate(secret):\n        x_unif[:, sec_idx] = grid[:, idx]\n\n    # 2. Skewed subset to provide gradient signal (p=0.1 on secret features, 0 elsewhere)\n    num_skewed = 1_024_000\n    x_skew = torch.zeros((num_skewed, config.n_features), dtype=torch.int64)\n    skewed_vals = (torch.rand(num_skewed, S, generator=generator) < 0.1).int()\n    for idx, sec_idx in enumerate(secret):\n        x_skew[:, sec_idx] = skewed_vals[:, idx]\n\n    x = torch.cat([x_unif, x_skew], dim=0).to(torch.float32)\n    y = parity_labels(x, secret)\n    return x, y\n\n\ndef get_optimizer_config(config: TaskConfig) -> dict[str, float]:\n    \"\"\"Return AdamW hyperparameters for the fixed training loop.\"\"\"\n    return {\n        \"lr\": 4e-3,\n        \"wd\": 2e-2,\n        \"beta1\": 0.9,\n        \"beta2\": 0.999,\n    }",
6  "start_line": 220,
7  "op": "replace"
8}
Tool ResultStep 3
OK: Replaced lines 220..269 with 50 line(s) in pytorch-examples/optimization_parity/custom_strategy.py. Editable range: 220–269.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–269 | total: 553 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP without using the hidden secret."""
   222      for layer in model:
   223          if isinstance(layer, nn.Linear):
   224              gain = nn.init.calculate_gain("relu") if layer is model[0] else 1.0
   225              nn.init.xavier_uniform_(layer.weight, gain=gain)
   226              if layer is model[0]:
   227                  nn.init.normal_(layer.bias, mean=0.0, std=1.0)
   228              e
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### n32-k8 (scripts/eval.sh)
[output file not found: /srv/home/bohanlyu/MLS-Bench/logs/optimization-parity/google_gemini-3.1-pro-preview_20260405_234119/20260406_010738/group_1_0/n32-k8_s42.out]

### n50-k8 (scripts/eval_n50_k8.sh)
Training progress:
TRAIN_METRICS secret=4 order=2 step=7500 loss=0.024611 acc=0.998375
TRAIN_METRICS secret=4 order=2 step=7750 loss=0.022386 acc=0.999031
TRAIN_METRICS secret=4 order=2 step=8000 loss=0.020291 acc=0.999375
TRAIN_METRICS secret=4 order=2 step=8250 loss=0.019581 acc=0.999500
TRAIN_METRICS secret=4 order=2 step=8500 loss=0.018682 acc=0.999406

Recent runs:
RUN_METRICS secret=3 order=1 steps=9000 test_accuracy=1.000000
RUN_METRICS secret=3 order=2 steps=11750 test_accuracy=0.999451
RUN_METRICS secret=4 order=0 steps=11500 test_accuracy=0.999756
RUN_METRICS secret=4 order=1 steps=12000 test_accuracy=0.999939
RUN_METRICS secret=4 order=2 steps=8500 test_accuracy=0.998047

Final 
AssistantStep 5edit
Tool Input
1{
2  "filename": "pytorch-examples/optimization_parity/custom_strategy.py",
3  "op": "replace",
4  "after_line": 219,
5  "end_line": 269,
6  "start_line": 220,
7  "content": "def init_model(model: nn.Sequential, config: TaskConfig) -> None:\n    \"\"\"Initialize the fixed two-layer MLP without using the hidden secret.\"\"\"\n    import math\n    for layer in model:\n        if isinstance(layer, nn.Linear):\n            if layer is model[0]:\n                gain = math.sqrt(config.n_features / 32.0) * math.sqrt(2.0)\n                nn.init.xavier_uniform_(layer.weight, gain=gain)\n                nn.init.normal_(layer.bias, mean=0.0, std=1.0)\n            else:\n                nn.init.xavier_uniform_(layer.weight, gain=1.0)\n                nn.init.zeros_(layer.bias)\n\n\ndef make_dataset(\n    secret: tuple[int, ...],\n    config: TaskConfig,\n    seed: int,\n) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Return a reproducible training dataset for one hidden secret.\"\"\"\n    generator = torch.Generator().manual_seed(seed)\n    S = len(secret)\n\n    # 1. Perfectly balanced uniform subset\n    repeats = 4000  # 4000 * 256 = 1,024,000 examples\n    grid = torch.cartesian_prod(*[torch.tensor([0, 1])] * S)\n    grid = grid.repeat(repeats, 1)\n    \n    x_unif = torch.randint(0, 2, size=(repeats * (2**S), config.n_features), generator=generator)\n    for idx, sec_idx in enumerate(secret):\n        x_unif[:, sec_idx] = grid[:, idx]\n\n    # 2. Skewed subset to provide gradient signal (p=0.1 on secret features, 0 elsewhere)\n    num_skewed = 2_048_000\n    x_skew = torch.zeros((num_skewed, config.n_features), dtype=torch.int64)\n    skewed_vals = (torch.rand(num_skewed, S, generator=generator) < 0.1).int()\n    for idx, sec_idx in enumerate(secret):\n        x_skew[:, sec_idx] = skewed_vals[:, idx]\n\n    x = torch.cat([x_unif, x_skew], dim=0).to(torch.float32)\n    y = parity_labels(x, secret)\n    return x, y\n\n\ndef get_optimizer_config(config: TaskConfig) -> dict[str, float]:\n    \"\"\"Return AdamW hyperparameters for the fixed training loop.\"\"\"\n    return {\n        \"lr\": 5e-3,\n        \"wd\": 2e-2,\n        \"beta1\": 0.9,\n        \"beta2\": 0.999,\n    }"
8}
Tool ResultStep 5
OK: Replaced lines 220..269 with 52 line(s) in pytorch-examples/optimization_parity/custom_strategy.py. Editable range: 220–271.

[Current file: pytorch-examples/optimization_parity/custom_strategy.py | editable: 220–271 | total: 555 lines]
   217| # =====================================================================
   218| # EDITABLE: init_model, make_dataset, get_optimizer_config
   219| # =====================================================================
   220  def init_model(model: nn.Sequential, config: TaskConfig) -> None:
   221      """Initialize the fixed two-layer MLP without using the hidden secret."""
   222      import math
   223      for layer in model:
   224          if isinstance(layer, nn.Linear):
   225              if layer is model[0]:
   226                  gain = math.sqrt(config.n_features / 32.0) * math.sqrt(2.0)
   227                  nn.init.xavier_uniform_(layer.weight, gain=gain)
   228                  nn.init.normal_(layer.bias, mean=0.0, std=1.