LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

Shaoxiang Chen,Zequn Jie,Lin Ma
2024-01-30
Abstract:Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflicts are inevitable when mixing instruction data from distinct domains, which can result in performance drops for tasks of a specific domain. To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs. Within the Transformer layers, we extend the popular Low-Rank Adaption (LoRA) method by creating a set of LoRA experts specifically for the MLP layer, and route each token to the top-1 expert based on a routing function, allowing adaptive choices for tokens from different domains. Since the LoRA experts are sparsely activated, the training and inference cost are kept roughly constant compared to the original LoRA method. By replacing the plain-LoRA of LLaVA-1.5 with our MoE design, our final model is named LLaVA-MoLE. Extensive experiments proved that LLaVA-MoLE effectively mitigates the data conflict issue when mixing multiple distinct instruction datasets with various configurations, and achieves consistent performance gains over the strong plain-LoRA baselines. Most importantly, on the mixed datasets, LLaVA-MoLE can even outperform the plain-LoRA baseline trained with twice the samples.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of data conflict that arises when mixing instruction data from different domains in Multimodal Large Language Models (MLLM). Specifically, when instruction data from different domains are mixed for fine-tuning MLLMs, it may lead to a decline in performance on specific tasks. For example, mixing document understanding and biomedical data with general multi-task data can significantly degrade the model's performance on general multi-task benchmarks. To solve this problem, the authors propose a method called Sparse Mixture of LoRA Experts (MoLE), which introduces multiple LoRA experts in the Transformer layers and activates the most appropriate expert based on the characteristics of the input tokens. This effectively mitigates data conflict and maintains or improves the model's performance across multiple benchmarks. ### Main Contributions: 1. **Identifying the Data Conflict Issue**: Based on advanced MLLM models and large-scale datasets, the authors discovered the data conflict issue that arises when mixing instruction data from different domains. 2. **Proposing the LLaVA-MoLE Model**: By using the Sparse Mixture of LoRA Experts method, the data conflict issue is resolved without significantly increasing training computation or memory overhead. This method also allows adjusting the sampling ratio of each dataset in the mixed data to achieve higher performance on specific tasks without affecting other tasks. 3. **Experimental Validation**: Extensive experimental results demonstrate that LLaVA-MoLE consistently improves performance across multiple benchmarks under various data configurations, showing significant advantages over traditional LoRA fine-tuning methods. ### Method Overview: - **Sparse Mixture of LoRA Experts**: Multiple LoRA experts are introduced in each Transformer layer, and the most appropriate expert is selected for activation through a routing function. Each token activates only one expert, thus keeping the computational cost comparable to the original LoRA method. - **Load Balancing**: A load balancing loss is introduced to ensure more even task distribution among experts, avoiding situations where some experts are overloaded while others are idle. ### Experimental Results: - **Performance Improvement**: LLaVA-MoLE significantly outperforms simple LoRA fine-tuning models on mixed datasets, even achieving better performance with a reduced number of training samples. - **Data Conflict Mitigation**: By adjusting the sampling frequency of datasets, LLaVA-MoLE effectively mitigates data conflict, maintaining or improving the model's performance across various benchmarks. In summary, this paper proposes an effective method to address the data conflict issue in multimodal large language models when mixing data from different domains, providing new insights for building more robust and general multimodal models.