dl-weight-initialization

Deep Learningpytorch-visionrigorous codebase

Description

DL Weight Initialization Strategy Design

Research Question

Design a novel weight initialization strategy for deep convolutional neural networks that improves convergence speed and final test accuracy across different architectures and datasets.

Background

Weight initialization is fundamental to training deep neural networks. Poor initialization leads to vanishing/exploding gradients, slow convergence, or suboptimal generalization. Classic methods include:

Kaiming/He (2015): Accounts for ReLU nonlinearity, N(0, sqrt(2/fan_out))
Orthogonal (2014): Preserves gradient norms via orthogonal matrices
Fixup (2019): Scales the last conv in each residual block by L^(-0.5) where L is the number of blocks, controlling variance accumulation across depth; zero-initializes the last BN per block so residual branches start near identity

However, these methods each address only one aspect of initialization. There is room to design strategies that jointly account for residual connections, batch normalization's re-scaling effect, depth-dependent scaling, and the interaction between different layer types (convolution vs classifier).

What You Can Modify

The initialize_weights(model, config) function (lines 147-180) in custom_init.py. This function receives the fully constructed model and a config dict, and must initialize all parameters.

You can modify:

How nn.Conv2d weights are initialized (distribution, fan-in/fan-out, gain)
How nn.BatchNorm2d parameters (weight/bias) are initialized
How nn.Linear weights and biases are initialized
Per-layer or depth-dependent scaling strategies
Special handling for residual shortcut projections vs main-path convolutions
Any data-independent initialization logic (no training data access)

The config dict provides: arch (str), num_classes (int), depth (int = number of Conv2d + Linear layers). You can also iterate over model.named_modules() or model.named_parameters().

Evaluation

Metric: Best test accuracy (%, higher is better)
Architectures & datasets:
- ResNet-56 on CIFAR-100 (deep residual, 100 classes)
- VGG-16-BN on CIFAR-100 (deep non-residual with BatchNorm, 100 classes)
- MobileNetV2 on FashionMNIST (lightweight inverted-residual, 10 classes) — hidden, evaluated on final submission only
Training: SGD (lr=0.1, momentum=0.9, wd=5e-4), cosine annealing, 200 epochs
Data augmentation: RandomCrop(32, pad=4) + RandomHorizontalFlip

Code

custom_init.py

EditableRead-only

1"""CV Weight Initialization Benchmark.
2
3Train vision models (ResNet, VGG, MobileNetV2) on CIFAR-10/100/FashionMNIST to evaluate
4weight initialization strategies.
5
6FIXED: Model architectures, data pipeline, training loop.
7EDITABLE: initialize_weights() function.
8
9Usage:
10    python custom_init.py --arch resnet20 --dataset cifar10 --seed 42
11"""
12
13import argparse
14import math
15import os

Results

Model	Type	test acc resnet56-cifar100 ↑	test acc vgg16bn-cifar100 ↑	test acc mobilenetv2-fmnist ↑
fixup	baseline	72.370	74.370	94.480
kaiming_normal	baseline	72.070	73.380	94.490
orthogonal	baseline	72.080	72.830	93.880
anthropic/claude-opus-4.6	vanilla	72.700	74.350	94.850
deepseek-reasoner	vanilla	72.610	74.440	94.540
google/gemini-3.1-pro-preview	vanilla	72.990	74.210	94.500
openai/gpt-5.4	vanilla	72.600	74.460	94.510
qwen/qwen3.6-plus	vanilla	72.920	74.250	94.610
anthropic/claude-opus-4.6	agent	72.700	74.350	94.850
deepseek-reasoner	agent	72.610	74.440	94.540
google/gemini-3.1-pro-preview	agent	72.990	74.210	94.500
openai/gpt-5.4	agent	72.550	74.890	94.520
qwen/qwen3.6-plus	agent	72.920	74.250	94.610

Agent Conversations

deepseek-reasoner

17 steps