When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

Jacob Nielsen,Lukas Galke,Peter Schneider-Kamp
2024-11-08
Abstract:Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper explores the effectiveness and feasibility of using 1.58-bit (ternary) quantization training in different types of neural network models. Specifically, the paper addresses the following issues: 1. **Resource Demand Reduction**: Modern machine learning models, especially large language models (LLMs), are powerful but require substantial resources for training and inference. The paper investigates whether 1.58-bit quantization training can significantly reduce memory usage, lower latency, and increase throughput while maintaining model performance. 2. **Feasibility of Low-Precision Quantization Training**: Traditional post-training quantization methods can reduce resource demands but often at the cost of accuracy. The paper explores whether quantization-aware training can directly optimize quantized weights during training, thereby avoiding accuracy loss. 3. **Applicability to Different Model Architectures**: The paper studies 1.58-bit quantization training not only in non-transformer models (such as multilayer perceptrons and graph neural networks) but also extends to transformer models, including encoder-only models, decoder-only models, and encoder-decoder models. The goal is to verify the performance of 1.58-bit quantization training across various model architectures. 4. **Performance vs. Capacity Relationship**: The paper examines the performance variations of 1.58-bit quantization training with different hidden layer sizes and proposes a scaling law. This law suggests that in certain cases, increasing model capacity can compensate for the performance loss due to low precision. 5. **Regularization Effect**: The paper observes that 1.58-bit quantization training may have a regularization effect, preventing or at least delaying overfitting, and thus outperforming 16-bit models in some tasks. ### Main Contributions - **Comprehensive Exploration from Simple Tasks to Complex Models**: The paper starts with the classic X-OR problem and gradually extends to multilayer perceptrons, graph neural networks, and transformer models, demonstrating the application of 1.58-bit quantization training in different models. - **Performance Comparison**: Experimental results show that in various model architectures, the performance of 1.58-bit quantization training is comparable to standard 16/32-bit models, and even better in some cases. - **Regularization Effect**: The paper finds that 1.58-bit quantization training has a regularization effect, which can prevent overfitting to some extent, especially in large language models. - **Adjustment of Model Capacity**: The paper proposes a scaling law that suggests increasing model capacity to compensate for the performance loss due to low precision, providing guidance for practical applications. In summary, through systematic experiments and analysis, this paper demonstrates the effectiveness and feasibility of 1.58-bit quantization training in various model architectures, offering new insights into reducing the resource demands of machine learning models.