cv-pooling-aggregation

Computer Visionpytorch-visionrigorous codebase

Description

CV Global Pooling / Feature Aggregation Design

Research Question

Design a novel global pooling or feature aggregation strategy for image classification that improves test accuracy across different CNN architectures and datasets.

Background

Global pooling is the final spatial aggregation step in modern CNNs, reducing feature maps from [B, C, H, W] to [B, C] before the classifier head. The standard approach is Global Average Pooling (GAP), which computes the spatial mean per channel. While simple and effective, GAP discards spatial structure and treats all positions equally. Alternative strategies include:

  • Global Max Pooling (GMP): Selects the strongest activation per channel, capturing the most salient features but ignoring distribution information.
  • Generalized Mean (GeM) Pooling (Radenovic et al., 2018): Learnable power-mean that interpolates between average and max pooling.
  • Average + Max: Element-wise combination of GAP and GMP, capturing both mean-field and peak statistics.

There is room to design pooling strategies that better capture the spatial statistics of feature maps, adapt to different architectures, or learn task-specific aggregation patterns.

What You Can Modify

The CustomPool class (lines 31-48) in custom_pool.py. This class receives a 4D tensor [B, C, H, W] and must return a 2D tensor [B, C].

You can modify:

  • The aggregation function (mean, max, learned weights, attention, higher-order statistics)
  • Whether to use learnable parameters
  • How spatial information is summarized (single-point, multi-scale, distribution-based)
  • Channel-wise or spatial-wise weighting mechanisms
  • Any combination of the above

Constraints:

  • Input: [B, C, H, W] tensor (C varies by architecture: 64 for ResNet-56, 512 for VGG-16-BN, 1280 for MobileNetV2)
  • Output: [B, C] tensor (must match input channel dimension exactly)
  • Must work with variable spatial sizes (8×8 for ResNet on CIFAR, 1×1 for VGG after max-pools, 1×1 for MobileNetV2)
  • No access to training data or labels within the pooling layer

Evaluation

  • Metric: Best test accuracy (%, higher is better)
  • Architectures & datasets:
    • ResNet-56 on CIFAR-100 (deep residual, 100 classes; final feature map 8×8, C=64)
    • VGG-16-BN on CIFAR-100 (deep non-residual with BatchNorm, 100 classes; final feature map 1×1 after max-pools, C=512)
    • MobileNetV2 on FashionMNIST (lightweight inverted-residual, 10 classes) — hidden, evaluated on final submission only
  • Training: SGD (lr=0.1, momentum=0.9, wd=5e-4), cosine annealing, 200 epochs
  • Data augmentation: RandomCrop(32, pad=4) + RandomHorizontalFlip

Code

custom_pool.py
EditableRead-only
1"""CV Pooling / Feature Aggregation Benchmark.
2
3Train vision models (ResNet, VGG, MobileNetV2) on CIFAR-10/100/FashionMNIST to evaluate
4global pooling and feature aggregation strategies.
5
6FIXED: Model architectures, data pipeline, training loop.
7EDITABLE: CustomPool class.
8
9Usage:
10 python custom_pool.py --arch resnet20 --dataset cifar10 --seed 42
11"""
12
13import argparse
14import math
15import os

Results

ModelTypetest acc resnet56-cifar100 test acc vgg16bn-cifar100 test acc mobilenetv2-fmnist
avg_maxbaseline71.06072.55094.520
gembaseline72.39074.02094.810
global_maxbaseline69.96074.43094.270
anthropic/claude-opus-4.6vanilla70.1001.00094.190
deepseek-reasonervanilla72.6601.00094.500
google/gemini-3.1-pro-previewvanilla72.25074.39094.660
openai/gpt-5.4vanilla71.1401.00094.470
qwen/qwen3.6-plusvanilla71.6701.00094.000
anthropic/claude-opus-4.6agent71.79073.22093.980
deepseek-reasoneragent71.23071.66094.430
google/gemini-3.1-pro-previewagent72.25074.39094.660
openai/gpt-5.4agent71.29073.21094.610
qwen/qwen3.6-plusagent71.6701.00094.000

Agent Conversations