TernaryLLM: Ternarized Large Language Model

Tianqi Chen,Zhe Li,Weixiang Xu,Zeyu Zhu,Dong Li,Lu Tian,Emad Barsoum,Peisong Wang,Jian Cheng
2024-06-11
Abstract:Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem that large - language models (LLMs) have achieved remarkable performance in natural - language - processing tasks, but they have high computational costs and memory requirements. Specifically, the paper focuses on how to reduce the model's memory usage through ternarization and achieve energy - efficient floating - point addition operations, thereby improving the model's deployment efficiency. However, directly applying ternarization to LLMs faces two main challenges: 1. **Asymmetry and non - zero mean of weight distribution**: The weights in LLMs show an obvious asymmetric distribution and non - zero mean within certain groups, which makes the traditional symmetric ternarization method less effective. 2. **Information loss caused by extremely low - bit quantization**: Extremely low - bit quantization will lead to a large amount of information loss in pre - trained LLMs, including the narrowing of the feature representation range, the loss of the prominence of dominant channels, and the destruction of the clustering of semantically related words. To address these challenges, the paper proposes two methods: - **Dual Learnable Ternarization (DLT)**: Allows the scale and offset of ternarization to be learnable to adapt to the abnormal distribution of weights in LLMs. - **Outlier - Friendly Feature Knowledge Distillation (OFF)**: By maximizing the mutual information between the ternary model and the floating - point model, and using the outlier - resistant property of cosine similarity, the semantic information in the pre - trained model is recovered. Through these methods, the paper shows that TernaryLLM outperforms previous low - bit quantization methods in standard text generation and zero - sample task benchmark tests. In particular, on the LLaMA - 3 model, there are significant improvements in both the perplexity on the C4 dataset and the average accuracy of zero - sample tasks.