Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Zesen Cheng,Hang Zhang,Kehan Li,Sicong Leng,Zhiqiang Hu,Fei Wu,Deli Zhao,Xin Li,Lidong Bing
2024-10-23
Abstract:Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into arbitrary small blocks, avoiding full materialization of the similarity matrix. Furthermore, we introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems, employing ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method scales batch sizes to unprecedented levels. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB without sacrificing any accuracy. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the GPU memory bottleneck problem encountered when expanding the batch size in contrastive learning. Specifically, as the batch size increases, the memory required to calculate and store the image - text similarity matrix grows quadratically, which makes further increasing the batch size impractical and limits the potential performance improvement. ### Core of the Problem 1. **Memory Bottleneck in Contrastive Loss**: The traditional implementation of contrastive loss requires calculating and storing the entire similarity matrix simultaneously on all devices, resulting in a sharp increase in memory consumption as the batch size increases. 2. **Limitations of Existing Methods**: Although some existing methods (such as Gradient - Cache, OpenCLIP, etc.) attempt to reduce memory consumption through distributed computing, these methods are still limited by the batch size, usually not exceeding 128k. ### Solution To solve the above problems, the paper proposes a new method named Inf - CL, and the main innovations include: 1. **Tile - Based Contrastive Loss Calculation**: Divide the calculation of the contrastive loss into multiple small tiles to avoid instantiating the entire similarity matrix at once. By iteratively accumulating LSE (log - sum - exp) terms, calculations can be performed on arbitrarily small tiles, thereby significantly reducing memory overhead. \[ L_I = -\frac{1}{b} \sum_{i = 1}^{b} \left( x_{i,i} - \log \sum_{j = 1}^{b} e^{x_{i,j}} \right) \] where \( x_{i,j} \) is the scaled cosine similarity between the \( i \)-th image and the \( j \)-th text segment. 2. **Multi - Level Tile Strategy**: - **Coarse - Grained Cross - GPU Tiles**: Distribute the image and text batches across multiple GPUs. Each GPU performs serial LSE calculations on multiple rows and minimizes communication overhead through asynchronous column communication. - **Fine - Grained Intra - GPU Tiles**: Inside each GPU, assign row calculations to multiple CUDA cores and merge iterations into a single kernel to reduce I/O overhead. 3. **Data Offloading Strategy**: To further reduce memory footprint, the paper introduces the "data offloading" technique, that is, only loading a small batch of data onto the GPU at each accumulation step, thereby stabilizing data memory usage. ### Experimental Results Experiments show that Inf - CL can significantly reduce memory consumption and support unprecedented large - batch training while maintaining accuracy. For example, on 32 A800 GPUs, Inf - CL can support a batch size of up to 12M, while traditional methods cannot achieve training at this scale under the same hardware conditions. ### Summary The main contribution of this paper is to propose an effective tile - based contrastive loss calculation method that can break through the existing memory bottleneck and support nearly unlimited batch size expansion. This not only improves the performance of contrastive learning but also provides new possibilities for large - scale model training.