Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Jordan Dotzel,Yuzong Chen,Bahaa Kotb,Sushma Prasad,Gang Wu,Sheng Li,Mohamed S. Abdelfattah,Zhiru Zhang
2024-06-11
Abstract:The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks and conclude that most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and efficiency frontier across 11 datatypes by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits. The supporting code is hosted at <a class="link-external link-https" href="https://github.com/cornell-zhang/llm-datatypes" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the trade-off between accuracy and efficiency encountered during the low-precision quantization of large-scale language models (LLMs). Specifically: 1. **Limitations of Traditional Methods**: - Traditional low-precision integer formats, while meeting strict latency and power consumption requirements, fall short in terms of model accuracy. - Recent studies have shown that some new floating-point formats (such as Normal Float, NF4) improve model accuracy but require more chip area. 2. **Need for New Quantization Formats**: - By conducting a large-scale analysis of the weights and activation distributions of 30 neural networks, the authors found that most distributions conform to Student's t-distribution. - Based on this finding, the authors derived a theoretically optimal new format—Student Float (SF4)—and validated it on modern LLMs, demonstrating higher accuracy than NF4. 3. **Methods to Improve Accuracy**: - The authors proposed two methods to enhance the E2M1 format (super-range and super-precision) to further improve model accuracy. - By evaluating the model accuracy and hardware complexity of different data types, they plotted a quality-efficiency Pareto frontier curve, providing continuous trade-off options for 4-bit quantization. ### Main Contributions 1. **Large-Scale Analysis**: - Conducted a large-scale analysis of the weights and activation distributions of 30 neural networks, finding that most distributions conform to Student's t-distribution. 2. **Derivation of New Format**: - Derived a theoretically optimal data type—Student Float (SF4)—and validated its superiority in lookup table quantization. 3. **Enhanced E2M1 Format**: - Proposed two methods to enhance the E2M1 and Additive Powers-of-Two (APoT) data types to improve model accuracy. 4. **Quality-Efficiency Frontier Curve**: - Plotted the Pareto frontier curve of different data types in terms of model accuracy and performance, comparing FP4 and INT4, and discussing various variants of FP4. ### Conclusion Through these studies, the paper provides new data types and methods for 4-bit quantization, not only improving model accuracy but also maintaining low hardware overhead, thereby offering strong support for the efficient deployment of large-scale language models.