Abstract:Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at <a class="link-external link-https" href="https://github.com/JiazuoYu/PathWeave" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing multimodal large language models (MLLMs) face the problem of excessive computational burden when expanding to new modalities. Specifically, existing methods rely on a large amount of modality - specific pre - training and joint - modality fine - tuning, which leads to significant consumption of computational resources. When expanding to new modalities, these methods need to revisit all historical data and repeat the entire training process, limiting the continuous expansion ability of MLLMs. To solve this problem, the author proposes a flexible and scalable framework named PathWeave, which has the ability of modality - path switching and expansion, enabling MLLMs to continuously evolve for cross - modality reasoning (X - modal reasoning). PathWeave utilizes the concept of Continual Learning (CL) and develops an incremental training strategy based on pre - trained MLLMs, which can expand to new modalities using unimodal data without performing joint - modality pre - training. ### Specific Problem Description 1. **Excessive Computational Burden**: Existing methods rely on extensive modality - specific pre - training and joint - modality tuning, resulting in a significant computational burden. 2. **Difficulty in Expanding to New Modalities**: When expanding to new modalities, existing models need to revisit all historical data and repeat the entire training process, limiting their ability to continuously expand. 3. **Lack of Flexibility**: Existing methods lack flexibility when dealing with different modalities and have difficulty efficiently adapting to and integrating new modalities. ### Main Contributions of PathWeave 1. **Proposing an Efficient and Scalable Framework, PathWeave**, enabling MLLMs to gradually expand to multiple modalities without the need for joint - modality pre - training. 2. **Introducing a Novel Adapter - in - Adapter Framework**, which seamlessly integrates unimodal and cross - modality adapters to enhance modality alignment and interaction, especially during the incremental learning process. 3. **Establishing a Challenging MCL Benchmark**, defining clear evaluation metrics. Extensive experimental results show that PathWeave is effective in terms of modality plasticity and memory, and its performance is comparable to that of the state - of - the - art MLLMs while reducing the parameter training burden by at least 98.73%. Through these innovations, PathWeave effectively solves the computational burden and flexibility problems faced by existing MLLMs when expanding to new modalities, providing new ideas and methods for the continuous development of multimodal large language models.

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

ModaVerse: Efficiently Transforming Modalities with LLMs

Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI

MM-LLMs: Recent Advances in MultiModal Large Language Models

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

Improving Multimodal Large Language Models Using Continual Learning

LLMs Meet Multimodal Generation and Editing: A Survey

OneLLM: One Framework to Align All Modalities with Language

CaMML: Context-Aware Multimodal Learner for Large Models

Model Composition for Multimodal Large Language Models

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

NoteLLM-2: Multimodal Large Representation Models for Recommendation

A Survey on Multimodal Large Language Models

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Multimodal Representation Learning by Alternating Unimodal Adaptation

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning