BitNet a4.8: 4-bit Activations for 1-bit LLMs

Hongyu Wang,Shuming Ma,Furu Wei
2024-11-08
Abstract:Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is reducing the computational cost of large-scale language models (LLMs) during the inference process while maintaining their performance. Specifically, the paper introduces a new method called BitNet a4.8, which uses a hybrid quantization and sparsification strategy to enable 1-bit LLMs to use 4-bit activations, thereby further reducing the computational budget and improving inference efficiency. ### Main Problems: 1. **Reducing Inference Cost**: Although current 1-bit LLMs can match the performance of full-precision models with the same number of parameters and training data, they still face high computational costs during inference. 2. **Handling Outliers**: Low-precision or sparse activations are prone to outlier dimensions during training, which can lead to significant quantization errors and degraded performance in downstream tasks. 3. **Improving Sparsity and Quantization Efficiency**: By combining sparsification and quantization techniques, computational bottlenecks can be reduced, and the model's inference efficiency can be improved. ### Solutions: - **4-bit Activations**: Use 4-bit activations for the inputs of attention mechanisms and feedforward network layers. - **Sparsification and 8-bit Quantization**: Apply sparsification to intermediate states and then use 8-bit quantization. - **Hybrid Quantization Strategy**: By analyzing the activation distribution of 1-bit LLMs, selectively apply 4-bit quantization or sparsification to mitigate quantization errors caused by outliers. - **Two-Stage Training**: Gradually transition from 8-bit activations to 4-bit activations, requiring only a small number of training tokens to adapt to low-precision activations. ### Experimental Results: - **Comparable Performance**: BitNet a4.8 achieved performance comparable to BitNet b1.58 with the same training cost. - **Improved Inference Efficiency**: BitNet a4.8 is faster during inference, supports 4-bit (INT4/FP4) kernels, has only 55% activation parameters, and supports 3-bit KV cache, further enhancing the deployment and inference efficiency of large-scale LLMs. In summary, by introducing BitNet a4.8, this paper successfully addresses the high computational cost and outlier handling issues of 1-bit LLMs during inference, significantly improving the model's inference efficiency.