Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

Jinghan Yao,Sam Ade Jacobs,Masahiro Tanaka,Olatunji Ruwase,Aamir Shafi,Hari Subramoni,Dhabaleswar K. Panda
2024-08-30
Abstract:Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the resource and cost challenges faced by large - language models (LLMs) when processing extremely long - context sequences in fields such as natural language processing (NLP) and computational biology. Specifically: 1. **High resource requirements**: Directly training LLMs with extremely long - context capabilities requires a large amount of GPU resources and memory, resulting in high costs and increased complexity. 2. **Limitations of existing methods**: Methods that introduce long - context capabilities through downstream fine - tuning or adaptation have significant design limitations and cannot fully meet the needs of practical applications. 3. **Performance degradation**: When existing methods process extremely long contexts, the model performance often degrades, resulting in a reduction in output quality. To solve these problems, the authors propose a new method - the Fully Pipelined Distributed Transformer (FPDT) - to efficiently train long - context LLMs and achieve the ultimate in hardware efficiency. ### Main contributions 1. **Memory footprint analysis**: An end - to - end analysis of the memory footprint of LLM training was carried out, the memory peaks in common Transformer architectures were identified, and the redundant intermediate buffers in the forward and backward passes were optimized. 2. **Fully pipelined distributed Transformer design**: A fully pipelined distributed Transformer was designed based on DeepSpeed Ulysses, which is suitable for sequence lengths of millions of tokens. By using GPU and host CPU memory and pre - fetching techniques, an almost zero - overhead training process was achieved. 3. **Reduction of GPU memory occupation**: The GPU memory occupation of activation memory during the training process was significantly reduced. Through a dedicated double - buffering design, almost all pre - fetching was overlapped with computation. 4. **High - performance performance**: As shown in Table 1, the proposed method can support 2M - sequence training of 8B LLMs using only 4 GPUs, or 4M - sequence training of 70B models on 32 GPUs, which is 16 times longer than existing solutions while achieving more than 55% MFU (model floating - point operation utilization). 5. **Compatibility**: This method can be orthogonally combined with the memory - optimization techniques of the DeepSpeed ZeRO series and PyTorch FSDP, and is suitable for Transformer models of any size, such as GPT, Llama, etc. ### Related work 1. **Memory - efficient Transformer**: Multiple memory - efficient attention mechanisms were studied, such as FlashAttention, low - rank approximation, kernel - based methods, and sparse attention mechanisms. These methods maintain computational efficiency while reducing memory consumption. 2. **Long - context training**: The key contributions to handling long sequences in the Transformer architecture were reviewed, including methods such as Megatron - SP, Blockwise Parallel Transformer, Ring Attention, and DeepSpeed Ulysses. These methods solve the memory limitations of the standard Transformer model in different aspects, but also face some practical challenges, such as communication complexity and deployment problems in large - scale clusters. ### Design details 1. **Pipelining and scheduling**: By splitting the input tensor (i.e., the hidden state) and designing efficient pipelining and scheduling strategies, the memory peaks in the forward and backward passes were reduced. 2. **Double - buffering**: Unused sequences were stored in the host memory. Through the double - buffering technique, the mismatch between GPU computational throughput and PCIe link bandwidth was balanced to achieve an efficient training process. 3. **Experimental evaluation**: Experiments were carried out on multiple GPU nodes to verify the effectiveness and performance advantages of the FPDT method in supporting extremely long - sequence training. In summary, through the proposed FPDT method, this paper effectively solves the resource and cost problems in training long - context LLMs and provides a new solution for long - sequence processing in complex tasks.