Abstract:Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios. We also propose a new high-throughput framework to alleviate the computation and memory bottlenecks during the training and inference of MOE models. This framework reduces GPU memory consumption by 40% and token computation latency by 30% during both training and inference.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to improve performance in multi-task scenarios during the fine-tuning of large language models (LLMs) while reducing computational resources and memory consumption. Specifically, the paper proposes the MIXLORA method, which aims to combine the advantages of Low-Rank Adaptation (LoRA) and Mixture of Experts (MoE) to construct a resource-efficient sparse MoE model. ### Main Issues 1. **Insufficient Performance**: Although existing LoRA methods effectively address GPU memory limitations, their performance in multi-task scenarios is suboptimal. 2. **High Resource Demand**: Despite MoE models performing well in multi-task learning with fewer parameters, their resource demands remain high, especially on consumer-grade GPUs (memory less than 24GB). 3. **Parameter Efficiency**: How to further reduce computational complexity and memory consumption during training and inference while maintaining high performance. ### Solution The paper proposes a method called MIXLORA, with the following main features: 1. **Sparse MoE Structure**: Insert multiple LoRA-based expert modules into the pre-trained dense model and use common top-k routers for task allocation. 2. **Independent Attention Layer LoRA Adapters**: Unlike traditional LoRA methods, MIXLORA enhances model performance by using independent attention layer LoRA adapters. 3. **Load Balancing Loss**: Introduce auxiliary load balancing loss to address the issue of uneven router allocation. 4. **High-Performance Framework**: Design a new high-throughput framework to optimize the computational and memory bottlenecks of MIXLORA during training and inference. ### Experimental Results Experimental results show that MIXLORA improves accuracy by approximately 9% over existing state-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods in multi-task learning scenarios. Additionally, the framework reduces GPU memory consumption by 40% and computational latency by 30% during training and inference. ### Summary Through the MIXLORA method, the paper successfully achieves a balance between high performance and resource efficiency in multi-task learning, providing a new solution for the fine-tuning of large language models.

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

Mixture of LoRA Experts

mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

Retrieval-Augmented Mixture of LoRA Experts for Uploadable Machine Learning

GraphLoRA: Empowering LLMs Fine-Tuning via Graph Collaboration of MoE

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models Via MoE-Style Plugin.

Higher Layers Need More LoRA Experts

MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models

MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

MoR: Mixture of Ranks for Low-Rank Adaptation Tuning

SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach