Abstract:Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

What problem does this paper attempt to address?

The main problem this paper attempts to address is reducing the computational cost of large-scale language models (LLMs) during the inference process while maintaining their performance. Specifically, the paper introduces a new method called BitNet a4.8, which uses a hybrid quantization and sparsification strategy to enable 1-bit LLMs to use 4-bit activations, thereby further reducing the computational budget and improving inference efficiency. ### Main Problems: 1. **Reducing Inference Cost**: Although current 1-bit LLMs can match the performance of full-precision models with the same number of parameters and training data, they still face high computational costs during inference. 2. **Handling Outliers**: Low-precision or sparse activations are prone to outlier dimensions during training, which can lead to significant quantization errors and degraded performance in downstream tasks. 3. **Improving Sparsity and Quantization Efficiency**: By combining sparsification and quantization techniques, computational bottlenecks can be reduced, and the model's inference efficiency can be improved. ### Solutions: - **4-bit Activations**: Use 4-bit activations for the inputs of attention mechanisms and feedforward network layers. - **Sparsification and 8-bit Quantization**: Apply sparsification to intermediate states and then use 8-bit quantization. - **Hybrid Quantization Strategy**: By analyzing the activation distribution of 1-bit LLMs, selectively apply 4-bit quantization or sparsification to mitigate quantization errors caused by outliers. - **Two-Stage Training**: Gradually transition from 8-bit activations to 4-bit activations, requiring only a small number of training tokens to adapt to low-precision activations. ### Experimental Results: - **Comparable Performance**: BitNet a4.8 achieved performance comparable to BitNet b1.58 with the same training cost. - **Improved Inference Efficiency**: BitNet a4.8 is faster during inference, supports 4-bit (INT4/FP4) kernels, has only 55% activation parameters, and supports 3-bit KV cache, further enhancing the deployment and inference efficiency of large-scale LLMs. In summary, by introducing BitNet a4.8, this paper successfully addresses the high computational cost and outlier handling issues of 1-bit LLMs during inference, significantly improving the model's inference efficiency.

BitNet a4.8: 4-bit Activations for 1-bit LLMs

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

BitNet: Scaling 1-bit Transformers for Large Language Models

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

OneBit: Towards Extremely Low-bit Large Language Models

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network

BitNet: Bit-Regularized Deep Neural Networks

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

MKQ-BERT: Quantized BERT with 4-bits Weights and Activations