Abstract:Efficient transfer learning methods such as adapter-based methods have shown great success in unimodal models and vision-language models. However, existing methods have two main challenges in fine-tuning multimodal models. Firstly, they are designed for vision-language tasks and fail to extend to situations where there are more than two modalities. Secondly, they exhibit limited exploitation of interactions between modalities and lack efficiency. To address these issues, in this paper, we propose the loW-rank sequence multimodal adapter (Wander). We first use the outer product to fuse the information from different modalities in an element-wise way effectively. For efficiency, we use CP decomposition to factorize tensors into rank-one components and achieve substantial parameter reduction. Furthermore, we implement a token-level low-rank decomposition to extract more fine-grained features and sequence relationships between modalities. With these designs, Wander enables token-level interactions between sequences of different modalities in a parameter-efficient way. We conduct extensive experiments on datasets with different numbers of modalities, where Wander outperforms state-of-the-art efficient transfer learning methods consistently. The results fully demonstrate the effectiveness, efficiency and universality of Wander.

What problem does this paper attempt to address?

This paper attempts to solve two main problems of multimodal models in transfer learning: 1. **Existing methods are only applicable to vision - language tasks and cannot be extended to cases with more modalities**: Existing multimodal transfer learning techniques (such as Sung, Cho, and Bansal 2022; Lu et al. 2024) are mainly limited to fine - tuning vision - language models, only focusing on the interaction between these two modalities, and cannot be applied to multimodal models containing more modalities. 2. **Limited and inefficient use of inter - modal interactions**: When existing multimodal transfer learning strategies are applied to multimodal models, the use of inter - modal interactions is limited and lacks efficiency. Specifically, these methods focus on fusing vector representations from different modalities rather than sequence vector representations from different modalities, ignoring the temporal - dimensional interactions of various modalities. To solve these problems, the authors propose the Low - Rank Sequential Multimodal Adapter (Wander). Wander improves multimodal transfer learning in the following ways: - **Element - level information fusion**: Use the outer product to effectively fuse information from different modalities in an element - level manner. - **Tensor decomposition**: Utilize CP decomposition to decompose tensors into rank - one components, significantly reducing the number of parameters. - **Fine - grained feature extraction**: Extract more fine - grained features and inter - modal sequence relationships through low - rank decomposition at the token level. These designs enable Wander to achieve fine - grained token - level interactions between different modal sequences with parameter efficiency. Experimental results show that Wander consistently outperforms existing efficient transfer learning methods on multiple datasets and demonstrates its effectiveness, efficiency, and generality. ### Formula Summary 1. **Adapter Module**: \[ \text{Adapter}(x) = x + \text{Up}(\text{Nonlinear}(\text{Down}(x))) \] 2. **CP Decomposition**: \[ X = \sum_{r = 1}^{R} \bigotimes_{n = 1}^{N} a_r^n \] where \( R \) is the rank, \( a_r^n\in\mathbb{R}^{d_n} \), and \(\bigotimes_{n = 1}^{N}\) represents the tensor outer product operation. 3. **Outer - Product Multimodal Fusion**: \[ H = \bigotimes_{m = 1}^{M} h_m \] \[ \tilde{H} = W\cdot H + b = W\cdot\left( \bigotimes_{m = 1}^{M} h_m \right)+ b \] where \( W\in\mathbb{R}^{d_1\times d_2\times\cdots\times d_M\times d_h} \), and \( b \) is the bias term of the linear layer. 4. **Low - Rank Single - Vector Fusion**: \[ W_k^h = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{h,m,k} \] \[ W_h = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{h,m} \] 5. **Low - Rank Sequential Fusion**: \[ H_t = H\cdot W_h=\left( \bigotimes_{m = 1}^{M} h_m \right)\cdot W_h \] \[ \tilde{H}_t = W_t\cdot H_t = W_t\cdot H\cdot W_h \] \[ W_t = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{t,m} \]

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Cultural Concept Adaptation on Multimodal Reasoning

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

Efficient Transfer Learning for Video-language Foundation Models

ATLAS: Adapter-Based Multi-Modal Continual Learning with a Two-Stage Learning Strategy

Cross-Modal Adapter for Text-Video Retrieval

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Multimodal Representation Learning by Alternating Unimodal Adaptation

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Adapt and explore: Multimodal mixup for representation learning

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning

MMA: Multi-Modal Adapter for Vision-Language Models

One-stage Modality Distillation for Incomplete Multimodal Learning

On-the-fly Modulation for Balanced Multimodal Learning

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models