Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

Xun Zhu,Ying Hu,Fanbin Mo,Miao Li,Ji Wu
2024-11-01
Abstract:Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization in MLLMs, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector in MLLMs. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code and resources are available at <a class="link-external link-https" href="https://github.com/tsinghua-msiip/Uni-Med" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the "tug-of-war" problem encountered when building a unified medical foundation model in multimodal, multitask learning. Specifically: 1. **"Tug-of-war" problem in multitask optimization**: In multimodal, multitask learning, different tasks may emphasize different types of features, causing the shared connector to be unable to accommodate the diverse modal features required by each task. This can lead to performance degradation, especially in the highly specialized medical field. 2. **Limitations of existing methods**: Current research mainly focuses on improving language model components while neglecting the role of connectors. Existing multitask learning methods fail to effectively address the "tug-of-war" problem at the connector level and lack detailed interpretability analysis. To alleviate these issues, the paper proposes the **Uni-Med** model, a novel medical generalist foundation model that includes a general visual feature extraction module, a connector mixture of experts (CMoE) module, and a large language model (LLM). By introducing CMoE, the model can effectively align visual and language embedding spaces, thereby mitigating the "tug-of-war" problem and excelling in six different medical tasks, including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation, and image classification. ### Main Contributions: 1. **Proposing Uni-Med**: An open-source medical generalist foundation model with a unified interface and shared parameters capable of performing six different medical tasks. 2. **Designing CMoE**: A carefully designed connector component that significantly outperforms baseline models, with an average performance improvement of 8% in any configuration. 3. **Detailed explanation of the "tug-of-war" problem optimization**: Providing a detailed interpretability analysis from the perspectives of gradient optimization and parameter statistics. 4. **Superior performance on various tasks and datasets**: Compared to existing open-source, state-of-the-art medical multimodal large language models, Uni-Med demonstrates competitive or superior performance on all test sets. ### Experimental Results: - **Ablation studies**: Validated the effectiveness of CMoE in different configurations, particularly in various tasks and datasets. - **Impact of compression rate and number of projection experts**: Explored the impact of different compression rates and the number of projection experts on model performance, finding that appropriate visual feature compression can improve training efficiency without loss of performance. - **Effectiveness of LoRA and LoRA-MoE**: Compared the effects of using Low-Rank Adaptation (LoRA) and LoRA-MoE for LLM fine-tuning, further validating the advantages of CMoE. In summary, the paper successfully mitigates the "tug-of-war" problem in multitask learning by introducing the CMoE module and demonstrates excellent performance across multiple medical tasks.