optimization-gradient-compression
Description
Gradient Compression for Communication-Efficient Distributed Training
Research Question
Design a gradient compression operator that reduces communication cost in distributed training while maintaining convergence quality (test accuracy).
Background
In distributed data-parallel training, gradient communication is often the bottleneck. Workers compute local gradients, which must be aggregated (e.g., via all-reduce) before the optimizer step. Gradient compression reduces the volume of data communicated by applying lossy compression to gradients before transmission.
Three main families of compression exist:
- Sparsification: Keep only a subset of gradient elements (e.g., TopK selects the largest magnitudes)
- Quantization: Reduce the precision of gradient values (e.g., QSGD uses stochastic rounding to discrete levels)
- Low-rank approximation: Approximate gradient matrices with low-rank factors (e.g., PowerSGD)
A key challenge is that naive compression introduces bias or variance that degrades convergence. Error feedback (accumulating compression residuals for the next iteration) is a widely-used technique to correct this.
Task
Modify the Compressor class in custom_compressor.py. Your compressor must implement:
__init__(self, compress_ratio): Initialize with a target compression ratio (0.01 = 100x compression)compress(self, tensor, name): Compress a gradient tensor, returning(compressed_tensors, ctx)decompress(self, compressed_tensors, ctx): Reconstruct the gradient
The compressor may maintain internal state (e.g., error feedback residuals) across calls. The name parameter identifies parameters for per-parameter state tracking.
Interface
class Compressor:
def __init__(self, compress_ratio=0.01): ...
def compress(self, tensor, name) -> (list[Tensor], ctx): ...
def decompress(self, compressed_tensors, ctx) -> Tensor: ...
compress_ratio: Fraction of gradient elements/information to retain (0.01 = keep 1%)compressed_tensors: List of tensors that would be communicated over the networkctx: Local context (not communicated) needed for decompression- The decompressed tensor must have the same shape as the original input
Evaluation
Trained and evaluated on three settings with 100x compression (compress_ratio=0.01):
- ResNet-20 / CIFAR-10 (0.27M params): Small model, standard benchmark
- VGG-11-BN / CIFAR-100 (9.8M params): Larger model, harder 100-class problem
- ResNet-56 / CIFAR-10 (0.85M params): Deeper model, tests scalability
Metric: best test accuracy (higher is better). All settings use SGD with momentum, cosine LR schedule, and 200 training epochs.
Code
1"""Gradient Compression for Communication-Efficient Distributed Training.23Self-contained benchmark: trains standard vision models on CIFAR datasets4using data-parallel SGD with a pluggable gradient compressor.56The script simulates distributed training on a single node by:71. Computing gradients normally82. Applying compress() -> decompress() to each gradient (simulating communication)93. Using the decompressed gradient for the optimizer step1011This faithfully measures the effect of gradient compression on convergence12quality, which is the core ML-science question, without requiring multi-node13infrastructure.14"""15
Results
| Model | Type | best acc resnet20-cifar10 ↑ | test loss resnet20-cifar10 ↓ | best acc vgg11-cifar100 ↑ | test loss vgg11-cifar100 ↓ | best acc resnet56-cifar10 ↑ | test loss resnet56-cifar10 ↓ |
|---|---|---|---|---|---|---|---|
| qsgd | baseline | 90.407 | 0.316 | 47.683 | 2.044 | 94.010 | 0.261 |
| signsgd | baseline | 92.517 | 0.272 | 70.767 | 1.451 | 94.143 | 0.257 |
| topk_ef | baseline | 92.247 | 0.276 | 70.243 | 1.470 | 93.850 | 0.258 |
| anthropic/claude-opus-4.6 | vanilla | - | - | 70.370 | 1.492 | 94.183 | 0.240 |
| deepseek-reasoner | vanilla | - | - | 70.117 | 1.481 | 93.925 | 0.249 |
| google/gemini-3.1-pro-preview | vanilla | - | - | 70.370 | 1.459 | 94.163 | 0.244 |
| openai/gpt-5.4-pro | vanilla | 92.580 | 0.277 | 70.400 | 1.444 | 94.073 | 0.256 |
| qwen3.6-plus:free | vanilla | 90.763 | 0.306 | 66.957 | 1.716 | 92.133 | 0.295 |
| anthropic/claude-opus-4.6 | agent | 92.717 | 0.269 | 70.973 | 1.437 | 94.200 | 0.244 |
| deepseek-reasoner | agent | - | - | - | - | - | - |
| google/gemini-3.1-pro-preview | agent | - | - | 70.460 | 1.402 | 94.077 | 0.261 |
| openai/gpt-5.4-pro | agent | 92.787 | 0.266 | 70.720 | 1.450 | 94.117 | 0.257 |
| qwen3.6-plus:free | agent | - | - | 67.527 | 1.706 | 93.520 | 0.266 |