Abstract:The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks and conclude that most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and efficiency frontier across 11 datatypes by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits. The supporting code is hosted at <a class="link-external link-https" href="https://github.com/cornell-zhang/llm-datatypes" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the trade-off between accuracy and efficiency encountered during the low-precision quantization of large-scale language models (LLMs). Specifically: 1. **Limitations of Traditional Methods**: - Traditional low-precision integer formats, while meeting strict latency and power consumption requirements, fall short in terms of model accuracy. - Recent studies have shown that some new floating-point formats (such as Normal Float, NF4) improve model accuracy but require more chip area. 2. **Need for New Quantization Formats**: - By conducting a large-scale analysis of the weights and activation distributions of 30 neural networks, the authors found that most distributions conform to Student's t-distribution. - Based on this finding, the authors derived a theoretically optimal new format—Student Float (SF4)—and validated it on modern LLMs, demonstrating higher accuracy than NF4. 3. **Methods to Improve Accuracy**: - The authors proposed two methods to enhance the E2M1 format (super-range and super-precision) to further improve model accuracy. - By evaluating the model accuracy and hardware complexity of different data types, they plotted a quality-efficiency Pareto frontier curve, providing continuous trade-off options for 4-bit quantization. ### Main Contributions 1. **Large-Scale Analysis**: - Conducted a large-scale analysis of the weights and activation distributions of 30 neural networks, finding that most distributions conform to Student's t-distribution. 2. **Derivation of New Format**: - Derived a theoretically optimal data type—Student Float (SF4)—and validated its superiority in lookup table quantization. 3. **Enhanced E2M1 Format**: - Proposed two methods to enhance the E2M1 and Additive Powers-of-Two (APoT) data types to improve model accuracy. 4. **Quality-Efficiency Frontier Curve**: - Plotted the Pareto frontier curve of different data types in terms of model accuracy and performance, comparing FP4 and INT4, and discussing various variants of FP4. ### Conclusion Through these studies, the paper provides new data types and methods for 4-bit quantization, not only improving model accuracy but also maintaining low hardware overhead, thereby offering strong support for the efficient deployment of large-scale language models.

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

New Solutions on LLM Acceleration, Optimization, and Application

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

TernaryLLM: Ternarized Large Language Model

Demystifying Platform Requirements for Diverse LLM Inference Use Cases

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

LLM Stability: A detailed analysis with some surprises

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

AFPQ: Asymmetric Floating Point Quantization for LLMs

FP8-LM: Training FP8 Large Language Models

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs