TernaryLLM: Ternarized Large Language Model

Tianqi Chen,Zhe Li,Weixiang Xu,Zeyu Zhu,Dong Li,Lu Tian,Emad Barsoum,Peisong Wang,Jian Cheng

2024-06-11

Abstract:Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.

Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the problem that large - language models (LLMs) have achieved remarkable performance in natural - language - processing tasks, but they have high computational costs and memory requirements. Specifically, the paper focuses on how to reduce the model's memory usage through ternarization and achieve energy - efficient floating - point addition operations, thereby improving the model's deployment efficiency. However, directly applying ternarization to LLMs faces two main challenges: 1. **Asymmetry and non - zero mean of weight distribution**: The weights in LLMs show an obvious asymmetric distribution and non - zero mean within certain groups, which makes the traditional symmetric ternarization method less effective. 2. **Information loss caused by extremely low - bit quantization**: Extremely low - bit quantization will lead to a large amount of information loss in pre - trained LLMs, including the narrowing of the feature representation range, the loss of the prominence of dominant channels, and the destruction of the clustering of semantically related words. To address these challenges, the paper proposes two methods: - **Dual Learnable Ternarization (DLT)**: Allows the scale and offset of ternarization to be learnable to adapt to the abnormal distribution of weights in LLMs. - **Outlier - Friendly Feature Knowledge Distillation (OFF)**: By maximizing the mutual information between the ternary model and the floating - point model, and using the outlier - resistant property of cosine similarity, the semantic information in the pre - trained model is recovered. Through these methods, the paper shows that TernaryLLM outperforms previous low - bit quantization methods in standard text generation and zero - sample task benchmark tests. In particular, on the LLaMA - 3 model, there are significant improvements in both the perplexity on the C4 dataset and the average accuracy of zero - sample tasks.

TernaryLLM: Ternarized Large Language Model

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

Direct Quantized Training of Language Models with Stochastic Rounding

ARB-LLM: Alternating Refined Binarizations for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

LQER: Low-Rank Quantization Error Reconstruction for LLMs

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Evaluating Quantized Large Language Models

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

SqueezeLLM: Dense-and-Sparse Quantization

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM