L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Hyesung Jeon,Yulhwa Kim,Jae-joon Kim

2024-10-28

Abstract:Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically follow a two-step approach: first, applying post-training quantization (PTQ) to model weights, followed by PEFT on the quantized model. However, recovering from the quantization error introduced by PTQ through fine-tuning has proven challenging. Additionally, most PTQ-based PEFT methods result in a mixture of low-precision quantized weights and high-precision adapter weights, limiting the efficiency of full quantization during inference. While a previous method attempted to address these issues, it still suffers from limited adaptability due to the constrained LoRA parameter structure required to produce fully-quantized models. To overcome these challenges, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA to effectively reduce quantization error. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead while producing fully-quantized weights, enabling effective adaptation to downstream tasks. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in sub-4-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA model families and instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to efficiently perform Quantization-Aware Fine-Tuning (QAT) in large-scale language models (LLMs) to reduce the model's memory and computational costs while maintaining high accuracy. Specifically, existing methods typically adopt a two-step approach: first applying Post-Training Quantization (PTQ) to quantize the model weights, and then performing Parameter-Efficient Fine-Tuning (PEFT) on the quantized model. However, this approach has the following issues: 1. **Quantization error is difficult to recover**: It is very challenging to recover the quantization error introduced by PTQ through fine-tuning. 2. **Mixed precision issue**: Most PTQ-based PEFT methods ultimately produce mixed precision models, which limits the efficiency of fully quantized models during inference. 3. **Limited adaptability**: Some methods impose strict constraints on the LoRA parameter structure to achieve fully quantized models, which limits the fine-tuning capability. To address these issues, the paper proposes a new method—L4Q (Low-rank adaptive Learning quantization for LLMs), which combines Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA) to effectively reduce quantization error and generate fully quantized models. The main contributions of L4Q include: 1. **Fully quantized linear layer design**: By merging the original weights and LoRA parameters before quantization, it ensures that only quantized weights are used during inference. 2. **Memory-efficient QAT**: By locally computing weight gradients, it reduces memory overhead. 3. **Efficient LoRA training**: By reusing the weight gradients already computed during QAT parameter training, it improves the efficiency of LoRA training. 4. **Joint optimization of quantization and LoRA parameters**: By jointly optimizing quantization parameters and LoRA parameters, it enhances the accuracy of fully quantized models. Experimental results show that L4Q performs superiorly in reducing quantization error, improving model accuracy, and inference speed, especially in 4-bit and lower quantization scenarios.

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Low-Rank Quantization-Aware Training for LLMs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

QEFT: Quantization for Efficient Fine-Tuning of LLMs

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

AffineQuant: Affine Transformation Quantization for Large Language Models

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

RPTQ: Reorder-based Post-training Quantization for Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Evaluating Quantized Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models