Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Jinqiang Long,Yanqi Dai,Guoxing Yang,Hongpeng Lin,Nanyi Fei,Yizhao Gao,Zhiwu Lu
2024-11-16
Abstract:As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: <a class="link-external link-https" href="https://github.com/MetabrainAGI/Awaker" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the "multi - task conflict" problem encountered by multi - modal large language models (MLLMs) when handling various text and visual tasks. Specifically, since the data of different tasks have significant differences in representation and distribution, simply mixing the data of all tasks together for training will lead to performance degradation. To solve this problem, the author proposes a model named Awaker2.5 - VL, which adopts a parameter - efficient Mixture of Experts (MoE) architecture to obtain multi - task capabilities through multiple sparsely - activated experts. Each expert is designed as a Low - Rank Adaptation (LoRA) structure to accelerate the training and inference processes. Experimental results show that Awaker2.5 - VL performs excellently in several of the latest benchmark tests, demonstrating its effectiveness and superiority.