Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung,Penghui Qi,Min Lin,Xinyi Wan
2024-11-08
Abstract:Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at <a class="link-external link-https" href="https://github.com/sail-sg/VocabularyParallelism" rel="external noopener nofollow">this https URL</a> .
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of computational and memory usage imbalance caused by vocabulary layers during pipeline parallel (PP) training of large-scale language models. Specifically: 1. **Computational and Memory Usage Imbalance**: In traditional pipeline parallel methods, vocabulary layers are often concentrated in certain stages, leading to much higher computational and memory loads in these stages compared to others. This imbalance can cause pipeline bubbles, where some stages are idle, thus reducing the utilization of computational resources. 2. **Memory Bottleneck**: Due to the high computational and parameter memory demands of vocabulary layers, especially as the vocabulary size increases, this imbalance further exacerbates memory bottlenecks, limiting the scalability of the model. ### Solution To address the above issues, the paper proposes a method called Vocabulary Parallelism to balance computational and memory usage through the following steps: 1. **Vocabulary Layer Partitioning**: Evenly distribute the vocabulary layers across all pipeline devices to reduce the computational and memory load of individual stages. 2. **Computation Grouping**: Divide the computation of vocabulary layers into multiple passes and independently schedule these computations on each device, ensuring that dependencies are still met. 3. **Communication Optimization**: Propose two algorithms to reduce communication barriers within the vocabulary layers, thereby minimizing activation memory overhead. 4. **Integration into Existing Pipeline Scheduling**: Use a general method to combine vocabulary parallelism with existing pipeline scheduling, ensuring minimal impact on the original scheduling. ### Experimental Results Through extensive experiments, the paper demonstrates the effectiveness of its method: - **Throughput Improvement**: Compared to traditional naive methods, the vocabulary parallelism method significantly improves throughput across different vocabulary sizes, with improvements up to 51%. - **Memory Usage Balance**: The vocabulary parallelism method achieves better balance in computational and memory usage, particularly in large vocabulary scenarios, significantly reducing peak memory usage. ### Conclusion The vocabulary parallelism method proposed in the paper effectively addresses the computational and memory imbalance issues in pipeline parallel training, improving training efficiency and resource utilization. This method has significant practical implications, especially when training large-scale language models.