KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Rambod Azimi,Rishav Rishav,Marek Teichmann,Samira Ebrahimi Kahou
2024-10-28
Abstract:Large language models (LLMs) have demonstrated remarkable performance across various downstream tasks. However, the high computational and memory requirements of LLMs are a major bottleneck. To address this, parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) have been proposed to reduce computational costs while ensuring minimal loss in performance. Additionally, knowledge distillation (KD) has been a popular choice for obtaining compact student models from teacher models. In this work, we present KD-LoRA, a novel fine-tuning method that combines LoRA with KD. Our results demonstrate that KD-LoRA achieves performance comparable to full fine-tuning (FFT) and LoRA while significantly reducing resource requirements. Specifically, KD-LoRA retains 98% of LoRA's performance on the GLUE benchmark, while being 40% more compact. Additionally, KD-LoRA reduces GPU memory usage by 30% compared to LoRA, while decreasing inference time by 30% compared to both FFT and LoRA. We evaluate KD-LoRA across three encoder-only models: BERT, RoBERTa, and DeBERTaV3. Code is available at <a class="link-external link-https" href="https://github.com/rambodazimi/KD-LoRA" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high computational and memory requirements in the fine - tuning process of large - scale language models (LLMs). Specifically, although LLMs perform excellently in various downstream tasks, their large number of parameters leads to high computational costs and memory consumption, which has become a major bottleneck in practical applications. To address this challenge, researchers have proposed a variety of parameter - efficient fine - tuning (PEFT) methods, such as Low - Rank Adaptation (LoRA), to reduce computational costs while maintaining performance as much as possible. In addition, knowledge distillation (KD) techniques are also widely used to extract knowledge from large teacher models and generate more compact student models. This paper proposes a new method named KD - LoRA, which combines the advantages of LoRA and KD, aiming to achieve performance comparable to full - fine - tuning (FFT) and LoRA while significantly reducing resource requirements. Verified by experiments, KD - LoRA can not only maintain up to 98% of LoRA's performance, but also reduce the model size by 40%, GPU memory usage by 30%, and shorten the inference time by 30%. These improvements make KD - LoRA particularly suitable for deployment and use in resource - constrained environments. In summary, the main goal of this paper is to develop a new fine - tuning method by combining the techniques of LoRA and KD to solve the problem of excessive consumption of computational and memory resources in the LLMs fine - tuning process, thereby improving the efficiency and practicality of the model.