Abstract:Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at <a class="link-external link-https" href="https://github.com/sail-sg/VocabularyParallelism" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of computational and memory usage imbalance caused by vocabulary layers during pipeline parallel (PP) training of large-scale language models. Specifically: 1. **Computational and Memory Usage Imbalance**: In traditional pipeline parallel methods, vocabulary layers are often concentrated in certain stages, leading to much higher computational and memory loads in these stages compared to others. This imbalance can cause pipeline bubbles, where some stages are idle, thus reducing the utilization of computational resources. 2. **Memory Bottleneck**: Due to the high computational and parameter memory demands of vocabulary layers, especially as the vocabulary size increases, this imbalance further exacerbates memory bottlenecks, limiting the scalability of the model. ### Solution To address the above issues, the paper proposes a method called Vocabulary Parallelism to balance computational and memory usage through the following steps: 1. **Vocabulary Layer Partitioning**: Evenly distribute the vocabulary layers across all pipeline devices to reduce the computational and memory load of individual stages. 2. **Computation Grouping**: Divide the computation of vocabulary layers into multiple passes and independently schedule these computations on each device, ensuring that dependencies are still met. 3. **Communication Optimization**: Propose two algorithms to reduce communication barriers within the vocabulary layers, thereby minimizing activation memory overhead. 4. **Integration into Existing Pipeline Scheduling**: Use a general method to combine vocabulary parallelism with existing pipeline scheduling, ensuring minimal impact on the original scheduling. ### Experimental Results Through extensive experiments, the paper demonstrates the effectiveness of its method: - **Throughput Improvement**: Compared to traditional naive methods, the vocabulary parallelism method significantly improves throughput across different vocabulary sizes, with improvements up to 51%. - **Memory Usage Balance**: The vocabulary parallelism method achieves better balance in computational and memory usage, particularly in large vocabulary scenarios, significantly reducing peak memory usage. ### Conclusion The vocabulary parallelism method proposed in the paper effectively addresses the computational and memory imbalance issues in pipeline parallel training, improving training efficiency and resource utilization. This method has significant practical implications, especially when training large-scale language models.

Balancing Pipeline Parallelism with Vocabulary Parallelism

Pipeline Parallelism with Controllable Memory

Zero Bubble Pipeline Parallelism

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

System Level Asynchronous Virtual Pipeline on Dynamically and Partially Reconfigurable Architecture

3D Parallelism for Transformers Via Integer Programming

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models

PipeMare: Asynchronous Pipeline Parallel DNN Training

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

Layer-Condensed KV Cache for Efficient Inference of Large Language Models