MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Leyang Shen,Gongwei Chen,Rui Shao,Weili Guan,Liqiang Nie
2024-07-18
Abstract:Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at <a class="link-external link-https" href="https://github.com/JiuTian-VL/MoME" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issue of task interference faced by multimodal large language models (MLLMs) when handling diverse visual language (VL) tasks. Specifically, general-purpose MLLMs perform worse on most VL tasks compared to MLLMs trained specifically for certain task groups, mainly due to interference between different tasks. To solve this problem, the paper proposes a method called "Mixture of Multimodal Experts" (MoME). MoME consists of two key components: 1. **Mixture of Visual Experts (MoVE)**: This part aims to handle visual information by adaptively adjusting features from different visual encoders. MoVE includes an Adaptive Deformation Transformation (ADT) module to transform features output by different visual encoders to unify sequence lengths, and an instruction-based soft router to dynamically adjust and aggregate these transformed visual features based on given instructions. 2. **Mixture of Language Experts (MoLE)**: This part introduces a sparse gated expert network into the language model to achieve performance improvement with almost unchanged inference cost. The expert network in MoLE is designed as parameter-efficient adapters and is selected through an instance-level sparse activation router to meet different task requirements. In this way, MoME can address task differences in visual and language modalities separately, thereby reducing task interference and significantly improving the performance of general-purpose MLLMs on various VL tasks. Experimental results show that MoME can effectively improve the average performance of the model on different types of VL tasks, especially achieving significant results on specific task groups such as document understanding. Additionally, through visual analysis of routing results, it can be seen that both MoVE and MoLE can dynamically select the most suitable experts according to the different needs of tasks, further proving the effectiveness of the MoME method.