Abstract:Fine-tuning is often necessary to enhance the adaptability of Large Language Models (LLM) to downstream tasks. Nonetheless, the process of updating billions of parameters demands significant computational resources and training time, which poses a substantial obstacle to the widespread application of large-scale models in various scenarios. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research. However, current PEFT approaches that employ a limited set of global parameters (such as LoRA, which adds low-rank approximation matrices to all weights) face challenges in flexibly combining different computational modules in downstream tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon observed in MoE, we propose the utilization of contrastive learning to encourage experts to learn distinct features. We conducted experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks. With the same number of parameters, our approach outperforms LoRA significantly. In math reasoning, MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.

What problem does this paper attempt to address?

The paper mainly discusses how to reduce the demand for computing resources and training time in the fine-tuning process of large-scale language models (LLMs). The current parameter efficient fine-tuning (PEFT) methods, such as LoRA, update weights by adding low-rank matrices, but they have difficulties in flexibly combining different computation modules. To solve this problem, the paper proposes a new PEFT method called MoELoRA, which treats LoRA as a Mixture of Experts (MoE) system and utilizes contrastive learning to encourage each "expert" to learn different features. MoELoRA activates only the LoRA selected by the gating network during training and inference, allowing the "experts" relevant to a specific task to participate in gradient updates or forward inference. Through contrastive learning, the paper treats the output of the same expert as positive samples and the output of different experts as negative samples, encouraging experts to learn unique features. Experimental results show that MoELoRA outperforms LoRA in mathematical reasoning and commonsense reasoning tasks, and performs comparably to the 175B GPT-3.5 in some benchmark tests. In summary, the main contributions of the paper include: 1. Introducing a new PEFT method called MoELoRA, which treats LoRA as a MoE, dynamically combining multiple LoRA modules to adapt to downstream task requirements. 2. Applying contrastive learning to address the issue of random routing in the MoE architecture, encouraging experts to learn different features. 3. Experimental results on 11 datasets demonstrate that MoELoRA outperforms LoRA in all tasks, proving the effectiveness of contrastive learning in downstream tasks. Future work may include exploring the redefinition of commonsense tasks as knowledge-editing tasks and training different LoRA modules for each expert.

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Higher Layers Need More LoRA Experts

MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

Mixture of LoRA Experts

MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning

Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models

MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning.

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition

GraphLoRA: Empowering LLMs Fine-Tuning via Graph Collaboration of MoE

MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models