1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

Jinheng Wang,Hansong Zhou,Ting Song,Shaoguang Mao,Shuming Ma,Hongyu Wang,Yan Xia,Furu Wei
2024-10-23
Abstract:Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce <a class="link-external link-http" href="http://bitnet.cpp" rel="external noopener nofollow">this http URL</a>, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that <a class="link-external link-http" href="http://bitnet.cpp" rel="external noopener nofollow">this http URL</a> achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at <a class="link-external link-https" href="https://github.com/microsoft/BitNet" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?