MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Dengchun Li,Yingzi Ma,Naizheng Wang,Zhengmao Ye,Zhiyuan Cheng,Yinghao Tang,Yan Zhang,Lei Duan,Jie Zuo,Cal Yang,Mingjie Tang
2024-07-20
Abstract:Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios. We also propose a new high-throughput framework to alleviate the computation and memory bottlenecks during the training and inference of MOE models. This framework reduces GPU memory consumption by 40% and token computation latency by 30% during both training and inference.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to improve performance in multi-task scenarios during the fine-tuning of large language models (LLMs) while reducing computational resources and memory consumption. Specifically, the paper proposes the MIXLORA method, which aims to combine the advantages of Low-Rank Adaptation (LoRA) and Mixture of Experts (MoE) to construct a resource-efficient sparse MoE model. ### Main Issues 1. **Insufficient Performance**: Although existing LoRA methods effectively address GPU memory limitations, their performance in multi-task scenarios is suboptimal. 2. **High Resource Demand**: Despite MoE models performing well in multi-task learning with fewer parameters, their resource demands remain high, especially on consumer-grade GPUs (memory less than 24GB). 3. **Parameter Efficiency**: How to further reduce computational complexity and memory consumption during training and inference while maintaining high performance. ### Solution The paper proposes a method called MIXLORA, with the following main features: 1. **Sparse MoE Structure**: Insert multiple LoRA-based expert modules into the pre-trained dense model and use common top-k routers for task allocation. 2. **Independent Attention Layer LoRA Adapters**: Unlike traditional LoRA methods, MIXLORA enhances model performance by using independent attention layer LoRA adapters. 3. **Load Balancing Loss**: Introduce auxiliary load balancing loss to address the issue of uneven router allocation. 4. **High-Performance Framework**: Design a new high-throughput framework to optimize the computational and memory bottlenecks of MIXLORA during training and inference. ### Experimental Results Experimental results show that MIXLORA improves accuracy by approximately 9% over existing state-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods in multi-task learning scenarios. Additionally, the framework reduces GPU memory consumption by 40% and computational latency by 30% during training and inference. ### Summary Through the MIXLORA method, the paper successfully achieves a balance between high performance and resource efficiency in multi-task learning, providing a new solution for the fine-tuning of large language models.