Abstract:Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at <a class="link-external link-https" href="https://github.com/xinghaow99/BitStack" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the memory limitation issues faced when deploying large language models (LLMs) on local devices. Although scaling laws have enhanced the capabilities of LLMs, the main bottleneck has shifted from capability to usability, especially in environments with limited and variable memory resources. Traditional compression methods (such as quantization) usually require predefined compression ratios, and each setting requires a separate compression process, making deployment in variable memory environments complex. Specifically, the paper proposes BitStack, a novel, training-free weight compression method that can dynamically adjust the trade-off between memory usage and model performance at the megabyte level. By leveraging weight decomposition, BitStack can dynamically adjust model size, minimizing transfers between runtime memory and storage devices. This method has been experimentally validated across a wide range of tasks, showing that despite providing fine-grained size control, BitStack can match or surpass strong quantization baseline methods under extreme compression ratios. ### Main Contributions 1. **Identifying Challenges**: The paper highlights the challenges of deploying LLMs in variable memory environments, which existing model compression methods cannot handle. 2. **Proposing BitStack**: It proposes a training-free decomposed weight compression method, BitStack, enabling modern LLMs to perform memory-performance trade-offs at the megabyte level. 3. **Experimental Validation**: Extensive experiments on Llama 2, Llama 3, and Llama 3.1 models (ranging from 7/8B to 70B parameters) demonstrate that BitStack can match or surpass widely adopted quantization baseline methods like GPTQ and AWQ under extreme compression ratios. ### Method Overview The core idea of BitStack is to achieve fine-grained memory management through iterative decomposition of weight matrices. Specific steps include: 1. **Weight Decomposition**: Using Singular Value Decomposition (SVD) to decompose weight matrices into sub-matrices, retaining the most important components. 2. **Activation-Aware Decomposition**: Considering the variance of activation channels, scaling the weight matrices by row vectors to reduce quantization errors. 3. **Iterative Absolute Value Decomposition**: Decomposing weight matrices into their sign matrices and absolute value matrices to preserve more information. 4. **Residual Block Sorting**: Using a small calibration set to compute perplexity, evaluating the impact of each residual block on overall performance to determine the loading order. ### Experimental Results 1. **Performance under Extreme Compression Ratios**: BitStack performs excellently under extreme compression ratios, especially in 7/8B models, significantly outperforming GPTQ and AWQ at 2-bit and 3-bit compression ratios. 2. **Performance under Low Compression Ratios**: Even at lower compression ratios, BitStack maintains performance comparable to quantization baselines, particularly in larger models (such as the 70B model), where BitStack surpasses baseline methods without group quantization at all compression ratios. 3. **Evaluation on Instruction-Tuned Models**: Evaluation results on instruction-tuned models (such as Llama 3.1 Instruct 8B and 70B) show that BitStack can generate reasonable answers at different compression ratios, while AWQ often fails to generate coherent text under high compression ratios. In summary, BitStack provides an effective solution for efficiently deploying LLMs in environments with limited and variable memory resources.

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

On the Compressibility of Quantized Large Language Models

OneBit: Towards Extremely Low-bit Large Language Models

Direct Quantized Training of Language Models with Stochastic Rounding

Aggressive Post-Training Compression on Extremely Large Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

SqueezeLLM: Dense-and-Sparse Quantization

Compressing Large Language Models by Joint Sparsification and Quantization

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

LCQ: Low-Rank Codebook based Quantization for Large Language Models

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?