Q-MoE: Connector for MLLMs with Text-Driven Routing

Hanzi Wang,Jiamin Ren,Yifeng Ding,Lei Ren,Huixing Jiang,Wei Chen,Fangxiang Feng,Xiaojie Wang
DOI: https://doi.org/10.1145/3664647.3681369
2024-01-01
Abstract:Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to the task in hand. Most of previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore, an optimal path based training strategy is proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter.
What problem does this paper attempt to address?