MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Tongxu Luo,Jiahe Lei,Fangyu Lei,Weihao Liu,Shizhu He,Jun Zhao,Kang Liu
2024-02-20
Abstract:Fine-tuning is often necessary to enhance the adaptability of Large Language Models (LLM) to downstream tasks. Nonetheless, the process of updating billions of parameters demands significant computational resources and training time, which poses a substantial obstacle to the widespread application of large-scale models in various scenarios. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research. However, current PEFT approaches that employ a limited set of global parameters (such as LoRA, which adds low-rank approximation matrices to all weights) face challenges in flexibly combining different computational modules in downstream tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon observed in MoE, we propose the utilization of contrastive learning to encourage experts to learn distinct features. We conducted experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks. With the same number of parameters, our approach outperforms LoRA significantly. In math reasoning, MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.
Computation and Language
What problem does this paper attempt to address?
The paper mainly discusses how to reduce the demand for computing resources and training time in the fine-tuning process of large-scale language models (LLMs). The current parameter efficient fine-tuning (PEFT) methods, such as LoRA, update weights by adding low-rank matrices, but they have difficulties in flexibly combining different computation modules. To solve this problem, the paper proposes a new PEFT method called MoELoRA, which treats LoRA as a Mixture of Experts (MoE) system and utilizes contrastive learning to encourage each "expert" to learn different features. MoELoRA activates only the LoRA selected by the gating network during training and inference, allowing the "experts" relevant to a specific task to participate in gradient updates or forward inference. Through contrastive learning, the paper treats the output of the same expert as positive samples and the output of different experts as negative samples, encouraging experts to learn unique features. Experimental results show that MoELoRA outperforms LoRA in mathematical reasoning and commonsense reasoning tasks, and performs comparably to the 175B GPT-3.5 in some benchmark tests. In summary, the main contributions of the paper include: 1. Introducing a new PEFT method called MoELoRA, which treats LoRA as a MoE, dynamically combining multiple LoRA modules to adapt to downstream task requirements. 2. Applying contrastive learning to address the issue of random routing in the MoE architecture, encouraging experts to learn different features. 3. Experimental results on 11 datasets demonstrate that MoELoRA outperforms LoRA in all tasks, proving the effectiveness of contrastive learning in downstream tasks. Future work may include exploring the redefinition of commonsense tasks as knowledge-editing tasks and training different LoRA modules for each expert.