Optimizing Distributed Training on Frontier for Large Language Models

Sajal Dash,Isaac Lyngaas,Junqi Yin,Xiao Wang,Romain Egele,Guojing Cong,Feiyi Wang,Prasanna Balaprakash
2023-12-22
Abstract:Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism, to facilitate training a trillion-parameter model on Frontier. We empirically assess these techniques and their associated parameters to determine their impact on memory footprint, communication latency, and GPU's computational efficiency. We analyze the complex interplay among these techniques and find a strategy to combine them to achieve high throughput through hyperparameter tuning. We have identified efficient strategies for training large LLMs of varying sizes through empirical analysis and hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters, we achieved GPU throughputs of $38.38\%$, $36.14\%$, and $31.96\%$, respectively. For the training of the 175 Billion parameter model and the 1 Trillion parameter model, we achieved $100\%$ weak scaling efficiency on 1024 and 3072 MI250X GPUs, respectively. We also achieved strong scaling efficiencies of $89\%$ and $87\%$ for these two models.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
The research discussed in this paper is about optimizing the distributed training of large-scale language models (LLMs) on the Frontier supercomputer. With the success of LLMs in various downstream tasks, especially through fine-tuning, they have become foundational models. However, training LLMs with billions of parameters requires a substantial amount of computational resources. In this paper, the researchers explore strategies for efficient distributed training on Frontier, the world's first exascale supercomputer dedicated to open science. The study involves various model parallelism and data parallelism techniques, such as tensor parallelism, pipeline parallelism, and sliced data parallelism, to accommodate training trillion-parameter models on Frontier. They evaluate the impact of these techniques and their parameters on memory consumption, communication latency, and GPU computational efficiency. The researchers also analyze the intricate interactions among these techniques and search for strategies to achieve high throughput through hyperparameter tuning. The research results show that for large-scale LLMs of different sizes (22 billion, 175 billion, and 1 trillion parameters), GPU throughputs ranging from 38.38% to 31.96% were achieved. They also achieved 100% weak scaling efficiency for 175 billion and 1 trillion parameter models on 1024 and 3072 MI250X GPUs respectively, as well as 89% and 87% strong scaling efficiency. The main contributions of the paper include enabling distributed training algorithms and frameworks based on the ROCM software platform on AMD hardware, and developing optimized strategies for distributed training through hyperparameter search, effectively managing GPU memory walls and communication latency to train LLMs with billions to trillions of parameters. Additionally, the research explores specific methods for optimizing these tools on AMD GPU architecture to balance the trade-offs between computation, memory, and communication, improving training efficiency and model accuracy.