dl-weight-initialization
Description
DL Weight Initialization Strategy Design
Research Question
Design a novel weight initialization strategy for deep convolutional neural networks that improves convergence speed and final test accuracy across different architectures and datasets.
Background
Weight initialization is fundamental to training deep neural networks. Poor initialization leads to vanishing/exploding gradients, slow convergence, or suboptimal generalization. Classic methods include:
- Kaiming/He (2015): Accounts for ReLU nonlinearity, N(0, sqrt(2/fan_out))
- Orthogonal (2014): Preserves gradient norms via orthogonal matrices
- Fixup (2019): Scales the last conv in each residual block by L^(-0.5) where L is the number of blocks, controlling variance accumulation across depth; zero-initializes the last BN per block so residual branches start near identity
However, these methods each address only one aspect of initialization. There is room to design strategies that jointly account for residual connections, batch normalization's re-scaling effect, depth-dependent scaling, and the interaction between different layer types (convolution vs classifier).
What You Can Modify
The initialize_weights(model, config) function (lines 147-180) in custom_init.py. This function receives the fully constructed model and a config dict, and must initialize all parameters.
You can modify:
- How
nn.Conv2dweights are initialized (distribution, fan-in/fan-out, gain) - How
nn.BatchNorm2dparameters (weight/bias) are initialized - How
nn.Linearweights and biases are initialized - Per-layer or depth-dependent scaling strategies
- Special handling for residual shortcut projections vs main-path convolutions
- Any data-independent initialization logic (no training data access)
The config dict provides: arch (str), num_classes (int), depth (int = number of Conv2d + Linear layers). You can also iterate over model.named_modules() or model.named_parameters().
Evaluation
- Metric: Best test accuracy (%, higher is better)
- Architectures & datasets:
- ResNet-56 on CIFAR-100 (deep residual, 100 classes)
- VGG-16-BN on CIFAR-100 (deep non-residual with BatchNorm, 100 classes)
- MobileNetV2 on FashionMNIST (lightweight inverted-residual, 10 classes) — hidden, evaluated on final submission only
- Training: SGD (lr=0.1, momentum=0.9, wd=5e-4), cosine annealing, 200 epochs
- Data augmentation: RandomCrop(32, pad=4) + RandomHorizontalFlip
Code
1"""CV Weight Initialization Benchmark.23Train vision models (ResNet, VGG, MobileNetV2) on CIFAR-10/100/FashionMNIST to evaluate4weight initialization strategies.56FIXED: Model architectures, data pipeline, training loop.7EDITABLE: initialize_weights() function.89Usage:10python custom_init.py --arch resnet20 --dataset cifar10 --seed 4211"""1213import argparse14import math15import os
Results
| Model | Type | test acc resnet56-cifar100 ↑ | test acc vgg16bn-cifar100 ↑ | test acc mobilenetv2-fmnist ↑ |
|---|---|---|---|---|
| fixup | baseline | 72.370 | 74.370 | 94.480 |
| kaiming_normal | baseline | 72.070 | 73.380 | 94.490 |
| orthogonal | baseline | 72.080 | 72.830 | 93.880 |
| anthropic/claude-opus-4.6 | vanilla | 72.700 | 74.350 | 94.850 |
| deepseek-reasoner | vanilla | 72.610 | 74.440 | 94.540 |
| google/gemini-3.1-pro-preview | vanilla | 72.990 | 74.210 | 94.500 |
| openai/gpt-5.4 | vanilla | 72.600 | 74.460 | 94.510 |
| qwen/qwen3.6-plus | vanilla | 72.920 | 74.250 | 94.610 |
| anthropic/claude-opus-4.6 | agent | 72.700 | 74.350 | 94.850 |
| deepseek-reasoner | agent | 72.610 | 74.440 | 94.540 |
| google/gemini-3.1-pro-preview | agent | 72.990 | 74.210 | 94.500 |
| openai/gpt-5.4 | agent | 72.550 | 74.890 | 94.520 |
| qwen/qwen3.6-plus | agent | 72.920 | 74.250 | 94.610 |