One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Ke Yi,Yuhui Xu,Heng Chang,Chen Tang,Yuan Meng,Tong Zhang,Jia Li
2024-05-31
Abstract:Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient deployment of large - scale language models (LLMs) under different resource constraints. Specifically, the paper focuses on the following points: 1. **High memory requirements**: LLMs have huge storage and computational costs. For example, an LLaMA model with 7 billion parameters requires at least 280GB of GPU memory for inference. 2. **High cost of quantization training**: Although quantization can compress the model size and reduce the computational cost, existing quantization methods usually require long - term training to alleviate the performance degradation caused by quantization loss. In particular, when LLMs need to be deployed in different application scenarios, each scenario requires repeated quantization training, which further magnifies the problem of long training time. 3. **Generate multiple sub - networks in one training**: In order to reduce the training cost of multi - scenario deployment, the paper proposes a method of generating multiple optimal sub - networks in one training (once - for - all, OFA), generating sub - networks suitable for different resource constraints through one - time fine - tuning. ### Main contributions 1. **Introduce the one - time training paradigm**: For the first time, the paper applies the one - time training paradigm to large - language models, generating multiple sub - networks suitable for different scenarios through one - time fine - tuning, thereby reducing the training cost of multi - scenario deployment. 2. **Non - interfering fine - tuning**: By decoupling the weights of different configurations and introducing low - rank adapters, the interference problem between different quantization configurations is solved, and the training efficiency is improved. 3. **Resource - balanced sampling strategy**: A resource - balanced sampling strategy is proposed to ensure the fair distribution of training resources among sub - networks with different resource requirements, avoiding the training imbalance problem caused by the traditional uniform sampling strategy. ### Method overview 1. **Post - training quantization**: Reduce memory costs by quantizing pre - trained weights into low - bit representations. 2. **Layer - mixed - precision super - network**: Construct a layer - mixed - precision super - network with different quantization bit - width configurations, and each path represents a mixed - precision LLM. 3. **Non - interfering fine - tuning**: Avoid interference between different configurations by decoupling shared weights and introducing low - rank adapters. 4. **Resource - balanced sampling strategy**: Ensure that sub - networks with different resource requirements can obtain fair training opportunities by dynamically adjusting the sampling strategy. ### Experimental results 1. **MMLU benchmark test**: In the MMLU benchmark test, the performance of the sub - networks generated by one - time fine - tuning of LLM - QFA under different bit - width constraints is comparable to or better than that of the baseline method. 2. **Common Sense QA benchmark test**: In the Common Sense QA benchmark test, LLM - QFA shows significant advantages at medium bit - width (3 bits), especially on the LLaMA2 - 13B model, with an accuracy improvement of 3.5% compared to QA - LoRA. 3. **Resource efficiency**: Compared with traditional quantization methods, LLM - QFA significantly reduces the time cost of multi - scenario deployment while maintaining high performance. ### Conclusion The LLM - QFA framework proposed in the paper effectively reduces the training cost of large - scale language models in multi - scenario deployment by generating multiple sub - networks suitable for different resource constraints through one - time fine - tuning, while maintaining high performance.