BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang,Shuming Ma,Li Dong,Shaohan Huang,Huaijie Wang,Lingxiao Ma,Fan Yang,Ruiping Wang,Yi Wu,Furu Wei
2023-10-18
Abstract:The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the challenges of deploying large language models, particularly focusing on issues related to high energy consumption and memory bandwidth requirements. To tackle these problems, the research team proposed BitNet, a scalable and stable 1-bit Transformer architecture suitable for large-scale language models. Specifically, the main contributions of the paper include: 1. **BitNet Architecture**: A new component named BitLinear is proposed as a replacement for the nn.Linear layer in neural networks, used for training 1-bit weights from scratch. BitNet combines low-precision binary weights and quantized activations while maintaining high precision for optimizer states and gradients to ensure stability and accuracy during the training process. 2. **Quantization Method**: Weights are centralized to have a mean close to zero, and a scaling factor is used to reduce the l2 error between real-valued weights and quantized weights. Activation quantization uses the absmax method, scaling activation values to a specified range. 3. **Computational Efficiency**: A detailed analysis of BitNet's advantages in terms of arithmetic operation energy consumption and memory usage is provided, particularly highlighting a significant reduction in energy consumption for multiplication operations compared to traditional Transformers. 4. **Training Stability**: The straight-through estimator is used to handle the non-differentiability of the quantization function, and a mixed-precision training strategy is adopted, where weights and activations are quantized to low precision, while gradients and optimizer states remain in high precision. Additionally, the paper notes that larger learning rates help accelerate the optimization process. 5. **Performance Evaluation**: Experimental results show that BitNet achieves performance comparable to full-precision Transformers on language modeling tasks while significantly reducing memory usage and energy consumption. Furthermore, BitNet follows scaling laws similar to full-precision Transformers, indicating that it can scale to larger model sizes while maintaining efficiency and performance. In summary, the paper aims to address the memory and energy challenges faced by large-scale language models during deployment by introducing BitNet. Through a series of technical innovations, it not only improves the model's energy efficiency but also ensures an effective enhancement in model performance.