FP8-LM: Training FP8 Large Language Models

Houwen Peng,Kan Wu,Yixuan Wei,Guoshuai Zhao,Yuxiang Yang,Ze Liu,Yifan Xiong,Ziyue Yang,Bolin Ni,Jingcheng Hu,Ruihang Li,Miaosen Zhang,Chen Li,Jia Ning,Ruizhe Wang,Zheng Zhang,Shuguang Liu,Joe Chau,Han Hu,Peng Cheng
2023-12-19
Abstract:In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {<a class="link-external link-https" href="https://github.com/Azure/MS-AMP" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-http" href="http://aka.ms/MS.AMP" rel="external noopener nofollow">this http URL</a>}.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper primarily aims to address the high cost issues in the training process of large language models (LLMs), particularly by reducing costs in terms of computational resources, memory usage, and communication overhead. The core contribution of the paper is the proposal of a new mixed-precision framework that utilizes FP8 (an 8-bit floating-point format) to efficiently train large-scale language models. Specifically, the researchers found that during the training of large language models, most variables such as gradients and optimizer states can adopt low-precision data formats without affecting model accuracy and without the need to adjust hyperparameters. Based on this insight, the paper proposes a new FP8 automatic mixed-precision framework for training large language models. This framework provides three levels of FP8 application schemes to simplify the mixed-precision and distributed parallel training processes. These levels gradually incorporate 8-bit gradients, optimizer states, and distributed learning into the training process. Experimental results show that this FP8 mixed-precision training framework, when training the GPT-175B model on the H100 GPU platform, not only significantly reduces actual memory usage (by 39%) but is also 75% faster than the widely used BF16 framework (such as Megatron-LM) and 37% faster than the Nvidia Transformer Engine. This greatly reduces the training cost of large-scale foundational models. Moreover, the FP8 mixed-precision training method is versatile and can be seamlessly applied to other tasks, such as large language model instruction fine-tuning and human feedback loops in reinforcement learning, thereby saving fine-tuning costs. The research also shows that models trained using FP8 perform comparably to those trained using BF16 high-precision in both pre-training and downstream tasks. In summary, the paper aims to reduce the cost of training large language models by introducing the FP8 low-precision data format and demonstrates the effectiveness and practicality of this approach.