FP8-LM: Training FP8 Large Language Models

Houwen Peng,Kan Wu,Yixuan Wei,Guoshuai Zhao,Yuxiang Yang,Ze Liu,Yifan Xiong,Ziyue Yang,Bolin Ni,Jingcheng Hu,Ruihang Li,Miaosen Zhang,Chen Li,Jia Ning,Ruizhe Wang,Zheng Zhang,Shuguang Liu,Joe Chau,Han Hu,Peng Cheng

2023-12-19

Abstract:In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {<a class="link-external link-https" href="https://github.com/Azure/MS-AMP" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-http" href="http://aka.ms/MS.AMP" rel="external noopener nofollow">this http URL</a>}.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper primarily aims to address the high cost issues in the training process of large language models (LLMs), particularly by reducing costs in terms of computational resources, memory usage, and communication overhead. The core contribution of the paper is the proposal of a new mixed-precision framework that utilizes FP8 (an 8-bit floating-point format) to efficiently train large-scale language models. Specifically, the researchers found that during the training of large language models, most variables such as gradients and optimizer states can adopt low-precision data formats without affecting model accuracy and without the need to adjust hyperparameters. Based on this insight, the paper proposes a new FP8 automatic mixed-precision framework for training large language models. This framework provides three levels of FP8 application schemes to simplify the mixed-precision and distributed parallel training processes. These levels gradually incorporate 8-bit gradients, optimizer states, and distributed learning into the training process. Experimental results show that this FP8 mixed-precision training framework, when training the GPT-175B model on the H100 GPU platform, not only significantly reduces actual memory usage (by 39%) but is also 75% faster than the widely used BF16 framework (such as Megatron-LM) and 37% faster than the Nvidia Transformer Engine. This greatly reduces the training cost of large-scale foundational models. Moreover, the FP8 mixed-precision training method is versatile and can be seamlessly applied to other tasks, such as large language model instruction fine-tuning and human feedback loops in reinforcement learning, thereby saving fine-tuning costs. The research also shows that models trained using FP8 perform comparably to those trained using BF16 high-precision in both pre-training and downstream tasks. In summary, the paper aims to reduce the cost of training large language models by introducing the FP8 low-precision data format and demonstrates the effectiveness and practicality of this approach.

FP8-LM: Training FP8 Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

Scaling FP8 training to trillion-token LLMs

AFPQ: Asymmetric Floating Point Quantization for LLMs

FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

Optimizing Distributed Training on Frontier for Large Language Models

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Ascend HiFloat8 Format for Deep Learning

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching