DB-LLM: Accurate Dual-Binarization for Efficient LLMs

Hong Chen,Chengtao Lv,Liang Ding,Haotong Qin,Xiabin Zhou,Yifu Ding,Xuebo Liu,Min Zhang,Jinyang Guo,Xianglong Liu,Dacheng Tao
DOI: https://doi.org/10.18653/v1/2024.findings-acl.516
2024-01-01
Abstract:Large language models (LLMs) have significantly advanced the field of naturallanguage processing, while the expensive memory and computation consumptionimpede their practical deployment. Quantization emerges as one of the mosteffective methods for improving the computational efficiency of LLMs. However,existing ultra-low-bit quantization always causes severe accuracy drops. Inthis paper, we empirically relieve the micro and macro characteristics ofultra-low bit quantization and present a novel Dual-Binarization method forLLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantageof 2-bit-width and the efficiency advantage of binarization into account,introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantizedweights into two independent sets of binaries, FDB ensures the accuracy ofrepresentations and introduces flexibility, utilizing the efficient bitwiseoperations of binarization while retaining the inherent high sparsity ofultra-low bit quantization. For the macro-level, we find the distortion thatexists in the prediction of LLM after quantization, which is specified as thedeviations related to the ambiguity of samples. We propose the Deviation-AwareDistillation (DAD) method, enabling the model to focus differently on varioussamples. Comprehensive experiments show that our DB-LLM not only significantlysurpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization(eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional20% reduction in computational consumption compared to the SOTA method underthe same bit-width. Our code will be released soon.
What problem does this paper attempt to address?