Abstract:Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into arbitrary small blocks, avoiding full materialization of the similarity matrix. Furthermore, we introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems, employing ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method scales batch sizes to unprecedented levels. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB without sacrificing any accuracy. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.

What problem does this paper attempt to address?

This paper attempts to solve the GPU memory bottleneck problem encountered when expanding the batch size in contrastive learning. Specifically, as the batch size increases, the memory required to calculate and store the image - text similarity matrix grows quadratically, which makes further increasing the batch size impractical and limits the potential performance improvement. ### Core of the Problem 1. **Memory Bottleneck in Contrastive Loss**: The traditional implementation of contrastive loss requires calculating and storing the entire similarity matrix simultaneously on all devices, resulting in a sharp increase in memory consumption as the batch size increases. 2. **Limitations of Existing Methods**: Although some existing methods (such as Gradient - Cache, OpenCLIP, etc.) attempt to reduce memory consumption through distributed computing, these methods are still limited by the batch size, usually not exceeding 128k. ### Solution To solve the above problems, the paper proposes a new method named Inf - CL, and the main innovations include: 1. **Tile - Based Contrastive Loss Calculation**: Divide the calculation of the contrastive loss into multiple small tiles to avoid instantiating the entire similarity matrix at once. By iteratively accumulating LSE (log - sum - exp) terms, calculations can be performed on arbitrarily small tiles, thereby significantly reducing memory overhead. \[ L_I = -\frac{1}{b} \sum_{i = 1}^{b} \left( x_{i,i} - \log \sum_{j = 1}^{b} e^{x_{i,j}} \right) \] where \( x_{i,j} \) is the scaled cosine similarity between the \( i \)-th image and the \( j \)-th text segment. 2. **Multi - Level Tile Strategy**: - **Coarse - Grained Cross - GPU Tiles**: Distribute the image and text batches across multiple GPUs. Each GPU performs serial LSE calculations on multiple rows and minimizes communication overhead through asynchronous column communication. - **Fine - Grained Intra - GPU Tiles**: Inside each GPU, assign row calculations to multiple CUDA cores and merge iterations into a single kernel to reduce I/O overhead. 3. **Data Offloading Strategy**: To further reduce memory footprint, the paper introduces the "data offloading" technique, that is, only loading a small batch of data onto the GPU at each accumulation step, thereby stabilizing data memory usage. ### Experimental Results Experiments show that Inf - CL can significantly reduce memory consumption and support unprecedented large - batch training while maintaining accuracy. For example, on 32 A800 GPUs, Inf - CL can support a batch size of up to 12M, while traditional methods cannot achieve training at this scale under the same hardware conditions. ### Summary The main contribution of this paper is to propose an effective tile - based contrastive loss calculation method that can break through the existing memory bottleneck and support nearly unlimited batch size expansion. This not only improves the performance of contrastive learning but also provides new possibilities for large - scale model training.

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance.

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory

Concurrent Adversarial Learning for Large-Batch Training

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Towards Memory-Efficient Training for Extremely Large Output Spaces -- Learning with 500k Labels on a Single Commodity GPU

An Efficient 2D Method for Training Super-Large Deep Learning Models

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Contrastive Augmented Graph2Graph Memory Interaction for Few Shot Continual Learning

TCP: Triplet Contrastive-relationship Preserving for Class-Incremental Learning

APOLLO: SGD-like Memory, AdamW-level Performance

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

Reproducible scaling laws for contrastive language-image learning

Hecaton: Training Large Language Models with Scalable Chiplet Systems

Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers

Global Contrastive Batch Sampling via Optimization on Sample Permutations