Scaling FP8 training to trillion-token LLMs

Maxim Fishman,Brian Chmiel,Ron Banner,Daniel Soudry
2024-09-19
Abstract:We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $\sim 34 \%$ throughput improvement.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the stability and scalability issues of large-scale language models (LLM) during low-precision (FP8) training. Specifically, the authors attempt for the first time to train large language models using FP8 precision on datasets containing up to 2 trillion tokens, which is 20 times more than previous studies (up to 100 billion tokens). Through these long-duration training runs, the authors identified some key instability issues in FP8 training that were not apparent in previous short-term studies. ### Main Findings and Contributions 1. **Key Instabilities**: - The authors found that during long-duration training, FP8 training exhibits severe instabilities, primarily caused by the SwiGLU activation function amplifying outliers. - Through analysis and experiments, the authors discovered that this outlier amplification phenomenon becomes significant only after long-duration training and linked it to the SwiGLU weight alignment process. 2. **Smooth-SwiGLU**: - To address this newly discovered challenge, the authors introduced Smooth-SwiGLU, a new modification method for the SwiGLU activation function that effectively reduces outlier amplification without altering the function's behavior. - This innovation ensures the stability of FP8 during long-duration training, making it possible to use FP8 precision in large-scale LLM training. 3. **FP8 Quantization of the Adam Optimizer**: - The authors successfully quantized both momenta of the Adam optimizer to FP8 precision for the first time, further reducing memory usage during training. - This improvement not only enhances memory efficiency but also boosts the efficiency of large-scale LLM development. 4. **Experimental Results**: - Combining Smooth-SwiGLU and FP8 optimizer momentum quantization, the authors successfully trained a 7 billion parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving results comparable to the BF16 baseline while providing up to a 34% throughput improvement. - In zero-shot performance tests, the FP8 model performed comparably to the BF16 baseline model, validating the effectiveness and practicality of the method. ### Summary Through these innovations, the authors not only overcame key instability issues in FP8 training but also achieved significant performance improvements and memory optimizations in large-scale language model training. These achievements provide important references and foundations for future large-scale low-precision training.