L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Hyesung Jeon,Yulhwa Kim,Jae-joon Kim
2024-10-28
Abstract:Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically follow a two-step approach: first, applying post-training quantization (PTQ) to model weights, followed by PEFT on the quantized model. However, recovering from the quantization error introduced by PTQ through fine-tuning has proven challenging. Additionally, most PTQ-based PEFT methods result in a mixture of low-precision quantized weights and high-precision adapter weights, limiting the efficiency of full quantization during inference. While a previous method attempted to address these issues, it still suffers from limited adaptability due to the constrained LoRA parameter structure required to produce fully-quantized models. To overcome these challenges, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA to effectively reduce quantization error. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead while producing fully-quantized weights, enabling effective adaptation to downstream tasks. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in sub-4-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA model families and instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to efficiently perform Quantization-Aware Fine-Tuning (QAT) in large-scale language models (LLMs) to reduce the model's memory and computational costs while maintaining high accuracy. Specifically, existing methods typically adopt a two-step approach: first applying Post-Training Quantization (PTQ) to quantize the model weights, and then performing Parameter-Efficient Fine-Tuning (PEFT) on the quantized model. However, this approach has the following issues: 1. **Quantization error is difficult to recover**: It is very challenging to recover the quantization error introduced by PTQ through fine-tuning. 2. **Mixed precision issue**: Most PTQ-based PEFT methods ultimately produce mixed precision models, which limits the efficiency of fully quantized models during inference. 3. **Limited adaptability**: Some methods impose strict constraints on the LoRA parameter structure to achieve fully quantized models, which limits the fine-tuning capability. To address these issues, the paper proposes a new method—L4Q (Low-rank adaptive Learning quantization for LLMs), which combines Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA) to effectively reduce quantization error and generate fully quantized models. The main contributions of L4Q include: 1. **Fully quantized linear layer design**: By merging the original weights and LoRA parameters before quantization, it ensures that only quantized weights are used during inference. 2. **Memory-efficient QAT**: By locally computing weight gradients, it reduces memory overhead. 3. **Efficient LoRA training**: By reusing the weight gradients already computed during QAT parameter training, it improves the efficiency of LoRA training. 4. **Joint optimization of quantization and LoRA parameters**: By jointly optimizing quantization parameters and LoRA parameters, it enhances the accuracy of fully quantized models. Experimental results show that L4Q performs superiorly in reducing quantization error, improving model accuracy, and inference speed, especially in 4-bit and lower quantization scenarios.