Abstract:We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $\sim 34 \%$ throughput improvement.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the stability and scalability issues of large-scale language models (LLM) during low-precision (FP8) training. Specifically, the authors attempt for the first time to train large language models using FP8 precision on datasets containing up to 2 trillion tokens, which is 20 times more than previous studies (up to 100 billion tokens). Through these long-duration training runs, the authors identified some key instability issues in FP8 training that were not apparent in previous short-term studies. ### Main Findings and Contributions 1. **Key Instabilities**: - The authors found that during long-duration training, FP8 training exhibits severe instabilities, primarily caused by the SwiGLU activation function amplifying outliers. - Through analysis and experiments, the authors discovered that this outlier amplification phenomenon becomes significant only after long-duration training and linked it to the SwiGLU weight alignment process. 2. **Smooth-SwiGLU**: - To address this newly discovered challenge, the authors introduced Smooth-SwiGLU, a new modification method for the SwiGLU activation function that effectively reduces outlier amplification without altering the function's behavior. - This innovation ensures the stability of FP8 during long-duration training, making it possible to use FP8 precision in large-scale LLM training. 3. **FP8 Quantization of the Adam Optimizer**: - The authors successfully quantized both momenta of the Adam optimizer to FP8 precision for the first time, further reducing memory usage during training. - This improvement not only enhances memory efficiency but also boosts the efficiency of large-scale LLM development. 4. **Experimental Results**: - Combining Smooth-SwiGLU and FP8 optimizer momentum quantization, the authors successfully trained a 7 billion parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving results comparable to the BF16 baseline while providing up to a 34% throughput improvement. - In zero-shot performance tests, the FP8 model performed comparably to the BF16 baseline model, validating the effectiveness and practicality of the method. ### Summary Through these innovations, the authors not only overcame key instability issues in FP8 training but also achieved significant performance improvements and memory optimizations in large-scale language model training. These achievements provide important references and foundations for future large-scale low-precision training.

Scaling FP8 training to trillion-token LLMs

FP8-LM: Training FP8 Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

Stable and low-precision training for large-scale vision-language models

Optimizing Distributed Training on Frontier for Large Language Models

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

Unit Scaling: Out-of-the-Box Low-Precision Training

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

Language models scale reliably with over-training and on downstream tasks

Understanding the Impact of Post-Training Quantization on Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

AFPQ: Asymmetric Floating Point Quantization for LLMs