KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Rambod Azimi,Rishav Rishav,Marek Teichmann,Samira Ebrahimi Kahou

2024-10-28

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various downstream tasks. However, the high computational and memory requirements of LLMs are a major bottleneck. To address this, parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) have been proposed to reduce computational costs while ensuring minimal loss in performance. Additionally, knowledge distillation (KD) has been a popular choice for obtaining compact student models from teacher models. In this work, we present KD-LoRA, a novel fine-tuning method that combines LoRA with KD. Our results demonstrate that KD-LoRA achieves performance comparable to full fine-tuning (FFT) and LoRA while significantly reducing resource requirements. Specifically, KD-LoRA retains 98% of LoRA's performance on the GLUE benchmark, while being 40% more compact. Additionally, KD-LoRA reduces GPU memory usage by 30% compared to LoRA, while decreasing inference time by 30% compared to both FFT and LoRA. We evaluate KD-LoRA across three encoder-only models: BERT, RoBERTa, and DeBERTaV3. Code is available at <a class="link-external link-https" href="https://github.com/rambodazimi/KD-LoRA" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high computational and memory requirements in the fine - tuning process of large - scale language models (LLMs). Specifically, although LLMs perform excellently in various downstream tasks, their large number of parameters leads to high computational costs and memory consumption, which has become a major bottleneck in practical applications. To address this challenge, researchers have proposed a variety of parameter - efficient fine - tuning (PEFT) methods, such as Low - Rank Adaptation (LoRA), to reduce computational costs while maintaining performance as much as possible. In addition, knowledge distillation (KD) techniques are also widely used to extract knowledge from large teacher models and generate more compact student models. This paper proposes a new method named KD - LoRA, which combines the advantages of LoRA and KD, aiming to achieve performance comparable to full - fine - tuning (FFT) and LoRA while significantly reducing resource requirements. Verified by experiments, KD - LoRA can not only maintain up to 98% of LoRA's performance, but also reduce the model size by 40%, GPU memory usage by 30%, and shorten the inference time by 30%. These improvements make KD - LoRA particularly suitable for deployment and use in resource - constrained environments. In summary, the main goal of this paper is to develop a new fine - tuning method by combining the techniques of LoRA and KD to solve the problem of excessive consumption of computational and memory resources in the LLMs fine - tuning process, thereby improving the efficiency and practicality of the model.

KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization

LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning

DoRA: Weight-Decomposed Low-Rank Adaptation

HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning

VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation

mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

DLoRA: Distributed Parameter-Efficient Fine-Tuning Solution for Large Language Model

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization