Abstract:Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed training, aiming to unravel these bottlenecks and suggest optimization directions. However, such analyses often overlook three aspects unique to Transformer models: the specialized architecture, the dependency on various distributed strategies, and the requirement to balance computational and memory overhead. This paper aims to bridge this gap by offering a comprehensive examination of the performance bottlenecks inherent in distributed training of Transformer models, leveraging both theoretical analysis and empirical investigation. We propose an analytical framework tailored to these unique aspects of Transformers, facilitating a holistic evaluation of model architectures, distributed strategies, and resource consumption. Based on this analytical framework, we conduct a comparative analysis of theoretical performances and further systematically explore how various distributed training strategies fare in real-world scenarios. Most of the experimental results can be well explained by the analytical outcomes derived from the analytical framework. Notably, our findings suggest an advantage of pipeline parallelism over data parallelism for Transformer models. Moreover, we shed light on some unexpected outcomes, such as the potential for increased total memory overhead due to suboptimal model partitioning within pipeline parallelism. Additionally, we underscore the significance of communication block size and waiting time to further enhance performance.

3D Parallelism for Transformers Via Integer Programming

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Improving Automatic Parallel Training Via Balanced Memory Workload Optimization

Galvatron

H3T: Efficient Integration of Memory Optimization and Parallelism for High-Throughput Transformer Training

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

H3T: Efficient Integration of Memory Optimization and Parallelism for Large-scale Transformer Training.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

TAPS: Topology-Aware Intra-Operator Parallelism Strategy Searching Algorithm for Deep Neural Networks

Partial Tensorized Transformers for Natural Language Processing

An Efficient 2D Method for Training Super-Large Deep Learning Models

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

A Multi-Level Framework for Accelerating Training Transformer Models

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

Transformers Can Do Arithmetic with the Right Embeddings

A General and Efficient Training for Transformer via Token Expansion