Abstract:Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficient deployment of large - scale language models (LLMs) under different resource constraints. Specifically, the paper focuses on the following points: 1. **High memory requirements**: LLMs have huge storage and computational costs. For example, an LLaMA model with 7 billion parameters requires at least 280GB of GPU memory for inference. 2. **High cost of quantization training**: Although quantization can compress the model size and reduce the computational cost, existing quantization methods usually require long - term training to alleviate the performance degradation caused by quantization loss. In particular, when LLMs need to be deployed in different application scenarios, each scenario requires repeated quantization training, which further magnifies the problem of long training time. 3. **Generate multiple sub - networks in one training**: In order to reduce the training cost of multi - scenario deployment, the paper proposes a method of generating multiple optimal sub - networks in one training (once - for - all, OFA), generating sub - networks suitable for different resource constraints through one - time fine - tuning. ### Main contributions 1. **Introduce the one - time training paradigm**: For the first time, the paper applies the one - time training paradigm to large - language models, generating multiple sub - networks suitable for different scenarios through one - time fine - tuning, thereby reducing the training cost of multi - scenario deployment. 2. **Non - interfering fine - tuning**: By decoupling the weights of different configurations and introducing low - rank adapters, the interference problem between different quantization configurations is solved, and the training efficiency is improved. 3. **Resource - balanced sampling strategy**: A resource - balanced sampling strategy is proposed to ensure the fair distribution of training resources among sub - networks with different resource requirements, avoiding the training imbalance problem caused by the traditional uniform sampling strategy. ### Method overview 1. **Post - training quantization**: Reduce memory costs by quantizing pre - trained weights into low - bit representations. 2. **Layer - mixed - precision super - network**: Construct a layer - mixed - precision super - network with different quantization bit - width configurations, and each path represents a mixed - precision LLM. 3. **Non - interfering fine - tuning**: Avoid interference between different configurations by decoupling shared weights and introducing low - rank adapters. 4. **Resource - balanced sampling strategy**: Ensure that sub - networks with different resource requirements can obtain fair training opportunities by dynamically adjusting the sampling strategy. ### Experimental results 1. **MMLU benchmark test**: In the MMLU benchmark test, the performance of the sub - networks generated by one - time fine - tuning of LLM - QFA under different bit - width constraints is comparable to or better than that of the baseline method. 2. **Common Sense QA benchmark test**: In the Common Sense QA benchmark test, LLM - QFA shows significant advantages at medium bit - width (3 bits), especially on the LLaMA2 - 13B model, with an accuracy improvement of 3.5% compared to QA - LoRA. 3. **Resource efficiency**: Compared with traditional quantization methods, LLM - QFA significantly reduces the time cost of multi - scenario deployment while maintaining high performance. ### Conclusion The LLM - QFA framework proposed in the paper effectively reduces the training cost of large - scale language models in multi - scenario deployment by generating multiple sub - networks suitable for different resource constraints through one - time fine - tuning, while maintaining high performance.

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Evaluating Quantized Large Language Models

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Post Training Quantization of Large Language Models with Microscaling Formats

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

SqueezeLLM: Dense-and-Sparse Quantization

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

AffineQuant: Affine Transformation Quantization for Large Language Models

Low-Rank Quantization-Aware Training for LLMs

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models