Abstract:Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops of computation. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer dedicated to open science. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism, to facilitate training a trillion-parameter model on Frontier. We empirically assess these techniques and their associated parameters to determine their impact on memory footprint, communication latency, and GPU's computational efficiency. We analyze the complex interplay among these techniques and find a strategy to combine them to achieve high throughput through hyperparameter tuning. We have identified efficient strategies for training large LLMs of varying sizes through empirical analysis and hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters, we achieved GPU throughputs of $38.38\%$, $36.14\%$, and $31.96\%$, respectively. For the training of the 175 Billion parameter model and the 1 Trillion parameter model, we achieved $100\%$ weak scaling efficiency on 1024 and 3072 MI250X GPUs, respectively. We also achieved strong scaling efficiencies of $89\%$ and $87\%$ for these two models.

Evaluation of pre-training large language models on leadership-class supercomputers

Optimizing Distributed Training on Frontier for Large Language Models

Comparative Study of Large Language Model Architectures on Frontier

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Distributed Training of Large Language Models

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

Efficient Large-Scale Language Model Training on GPU Clusters

Large Language Models as Data Preprocessors

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

The Landscape and Challenges of HPC Research and LLMs

FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability

Inference Performance Optimization for Large Language Models on CPUs

HPC-GPT: Integrating Large Language Model for High-Performance Computing

Characterization of Large Language Model Development in the Datacenter

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

An Efficient 2D Method for Training Super-Large Deep Learning Models