cv-pooling-aggregation
Description
CV Global Pooling / Feature Aggregation Design
Research Question
Design a novel global pooling or feature aggregation strategy for image classification that improves test accuracy across different CNN architectures and datasets.
Background
Global pooling is the final spatial aggregation step in modern CNNs, reducing feature maps from [B, C, H, W] to [B, C] before the classifier head. The standard approach is Global Average Pooling (GAP), which computes the spatial mean per channel. While simple and effective, GAP discards spatial structure and treats all positions equally. Alternative strategies include:
- Global Max Pooling (GMP): Selects the strongest activation per channel, capturing the most salient features but ignoring distribution information.
- Generalized Mean (GeM) Pooling (Radenovic et al., 2018): Learnable power-mean that interpolates between average and max pooling.
- Average + Max: Element-wise combination of GAP and GMP, capturing both mean-field and peak statistics.
There is room to design pooling strategies that better capture the spatial statistics of feature maps, adapt to different architectures, or learn task-specific aggregation patterns.
What You Can Modify
The CustomPool class (lines 31-48) in custom_pool.py. This class receives a 4D tensor [B, C, H, W] and must return a 2D tensor [B, C].
You can modify:
- The aggregation function (mean, max, learned weights, attention, higher-order statistics)
- Whether to use learnable parameters
- How spatial information is summarized (single-point, multi-scale, distribution-based)
- Channel-wise or spatial-wise weighting mechanisms
- Any combination of the above
Constraints:
- Input: [B, C, H, W] tensor (C varies by architecture: 64 for ResNet-56, 512 for VGG-16-BN, 1280 for MobileNetV2)
- Output: [B, C] tensor (must match input channel dimension exactly)
- Must work with variable spatial sizes (8×8 for ResNet on CIFAR, 1×1 for VGG after max-pools, 1×1 for MobileNetV2)
- No access to training data or labels within the pooling layer
Evaluation
- Metric: Best test accuracy (%, higher is better)
- Architectures & datasets:
- ResNet-56 on CIFAR-100 (deep residual, 100 classes; final feature map 8×8, C=64)
- VGG-16-BN on CIFAR-100 (deep non-residual with BatchNorm, 100 classes; final feature map 1×1 after max-pools, C=512)
- MobileNetV2 on FashionMNIST (lightweight inverted-residual, 10 classes) — hidden, evaluated on final submission only
- Training: SGD (lr=0.1, momentum=0.9, wd=5e-4), cosine annealing, 200 epochs
- Data augmentation: RandomCrop(32, pad=4) + RandomHorizontalFlip
Code
1"""CV Pooling / Feature Aggregation Benchmark.23Train vision models (ResNet, VGG, MobileNetV2) on CIFAR-10/100/FashionMNIST to evaluate4global pooling and feature aggregation strategies.56FIXED: Model architectures, data pipeline, training loop.7EDITABLE: CustomPool class.89Usage:10python custom_pool.py --arch resnet20 --dataset cifar10 --seed 4211"""1213import argparse14import math15import os
Results
| Model | Type | test acc resnet56-cifar100 ↑ | test acc vgg16bn-cifar100 ↑ | test acc mobilenetv2-fmnist ↑ |
|---|---|---|---|---|
| avg_max | baseline | 71.060 | 72.550 | 94.520 |
| gem | baseline | 72.390 | 74.020 | 94.810 |
| global_max | baseline | 69.960 | 74.430 | 94.270 |
| anthropic/claude-opus-4.6 | vanilla | 70.100 | 1.000 | 94.190 |
| deepseek-reasoner | vanilla | 72.660 | 1.000 | 94.500 |
| google/gemini-3.1-pro-preview | vanilla | 72.250 | 74.390 | 94.660 |
| openai/gpt-5.4 | vanilla | 71.140 | 1.000 | 94.470 |
| qwen/qwen3.6-plus | vanilla | 71.670 | 1.000 | 94.000 |
| anthropic/claude-opus-4.6 | agent | 71.790 | 73.220 | 93.980 |
| deepseek-reasoner | agent | 71.230 | 71.660 | 94.430 |
| google/gemini-3.1-pro-preview | agent | 72.250 | 74.390 | 94.660 |
| openai/gpt-5.4 | agent | 71.290 | 73.210 | 94.610 |
| qwen/qwen3.6-plus | agent | 71.670 | 1.000 | 94.000 |