Abstract:Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism, to facilitate training a trillion-parameter model on Frontier. We empirically assess these techniques and their associated parameters to determine their impact on memory footprint, communication latency, and GPU's computational efficiency. We analyze the complex interplay among these techniques and find a strategy to combine them to achieve high throughput through hyperparameter tuning. We have identified efficient strategies for training large LLMs of varying sizes through empirical analysis and hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters, we achieved GPU throughputs of $38.38\%$, $36.14\%$, and $31.96\%$, respectively. For the training of the 175 Billion parameter model and the 1 Trillion parameter model, we achieved $100\%$ weak scaling efficiency on 1024 and 3072 MI250X GPUs, respectively. We also achieved strong scaling efficiencies of $89\%$ and $87\%$ for these two models.

What problem does this paper attempt to address?

The research discussed in this paper is about optimizing the distributed training of large-scale language models (LLMs) on the Frontier supercomputer. With the success of LLMs in various downstream tasks, especially through fine-tuning, they have become foundational models. However, training LLMs with billions of parameters requires a substantial amount of computational resources. In this paper, the researchers explore strategies for efficient distributed training on Frontier, the world's first exascale supercomputer dedicated to open science. The study involves various model parallelism and data parallelism techniques, such as tensor parallelism, pipeline parallelism, and sliced data parallelism, to accommodate training trillion-parameter models on Frontier. They evaluate the impact of these techniques and their parameters on memory consumption, communication latency, and GPU computational efficiency. The researchers also analyze the intricate interactions among these techniques and search for strategies to achieve high throughput through hyperparameter tuning. The research results show that for large-scale LLMs of different sizes (22 billion, 175 billion, and 1 trillion parameters), GPU throughputs ranging from 38.38% to 31.96% were achieved. They also achieved 100% weak scaling efficiency for 175 billion and 1 trillion parameter models on 1024 and 3072 MI250X GPUs respectively, as well as 89% and 87% strong scaling efficiency. The main contributions of the paper include enabling distributed training algorithms and frameworks based on the ROCM software platform on AMD hardware, and developing optimized strategies for distributed training through hyperparameter search, effectively managing GPU memory walls and communication latency to train LLMs with billions to trillions of parameters. Additionally, the research explores specific methods for optimizing these tools on AMD GPU architecture to balance the trade-offs between computation, memory, and communication, improving training efficiency and model accuracy.

Optimizing Distributed Training on Frontier for Large Language Models

Evaluation of pre-training large language models on leadership-class supercomputers

Efficient Large-Scale Language Model Training on GPU Clusters

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Comparative Study of Large Language Model Architectures on Frontier

An Efficient 2D Method for Training Super-Large Deep Learning Models

Distributed Training of Large Language Models

Pretraining Billion-scale Geospatial Foundational Models on Frontier

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Accelerating Large Language Model Training with Hybrid GPU-based Compression

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Training Compute-Optimal Large Language Models

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Elixir: Train a Large Language Model on a Small GPU Cluster

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Distributed Inference and Fine-tuning of Large Language Models Over The Internet