Abstract:Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper explores the effectiveness and feasibility of using 1.58-bit (ternary) quantization training in different types of neural network models. Specifically, the paper addresses the following issues: 1. **Resource Demand Reduction**: Modern machine learning models, especially large language models (LLMs), are powerful but require substantial resources for training and inference. The paper investigates whether 1.58-bit quantization training can significantly reduce memory usage, lower latency, and increase throughput while maintaining model performance. 2. **Feasibility of Low-Precision Quantization Training**: Traditional post-training quantization methods can reduce resource demands but often at the cost of accuracy. The paper explores whether quantization-aware training can directly optimize quantized weights during training, thereby avoiding accuracy loss. 3. **Applicability to Different Model Architectures**: The paper studies 1.58-bit quantization training not only in non-transformer models (such as multilayer perceptrons and graph neural networks) but also extends to transformer models, including encoder-only models, decoder-only models, and encoder-decoder models. The goal is to verify the performance of 1.58-bit quantization training across various model architectures. 4. **Performance vs. Capacity Relationship**: The paper examines the performance variations of 1.58-bit quantization training with different hidden layer sizes and proposes a scaling law. This law suggests that in certain cases, increasing model capacity can compensate for the performance loss due to low precision. 5. **Regularization Effect**: The paper observes that 1.58-bit quantization training may have a regularization effect, preventing or at least delaying overfitting, and thus outperforming 16-bit models in some tasks. ### Main Contributions - **Comprehensive Exploration from Simple Tasks to Complex Models**: The paper starts with the classic X-OR problem and gradually extends to multilayer perceptrons, graph neural networks, and transformer models, demonstrating the application of 1.58-bit quantization training in different models. - **Performance Comparison**: Experimental results show that in various model architectures, the performance of 1.58-bit quantization training is comparable to standard 16/32-bit models, and even better in some cases. - **Regularization Effect**: The paper finds that 1.58-bit quantization training has a regularization effect, which can prevent overfitting to some extent, especially in large language models. - **Adjustment of Model Capacity**: The paper proposes a scaling law that suggests increasing model capacity to compensate for the performance loss due to low precision, providing guidance for practical applications. In summary, through systematic experiments and analysis, this paper demonstrates the effectiveness and feasibility of 1.58-bit quantization training in various model architectures, offering new insights into reducing the resource demands of machine learning models.

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

BitNet: Scaling 1-bit Transformers for Large Language Models

OneBit: Towards Extremely Low-bit Large Language Models

Bit Efficient Quantization for Deep Neural Networks

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Direct Quantized Training of Language Models with Stochastic Rounding

Binarized Neural Machine Translation

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Exploring Extreme Quantization in Spiking Language Models

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

A Comprehensive Evaluation of Quantization Strategies for Large Language Models