A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter

Zirun Guo,Xize Cheng,Yangyang Wu,Tao Jin
2024-12-12
Abstract:Efficient transfer learning methods such as adapter-based methods have shown great success in unimodal models and vision-language models. However, existing methods have two main challenges in fine-tuning multimodal models. Firstly, they are designed for vision-language tasks and fail to extend to situations where there are more than two modalities. Secondly, they exhibit limited exploitation of interactions between modalities and lack efficiency. To address these issues, in this paper, we propose the loW-rank sequence multimodal adapter (Wander). We first use the outer product to fuse the information from different modalities in an element-wise way effectively. For efficiency, we use CP decomposition to factorize tensors into rank-one components and achieve substantial parameter reduction. Furthermore, we implement a token-level low-rank decomposition to extract more fine-grained features and sequence relationships between modalities. With these designs, Wander enables token-level interactions between sequences of different modalities in a parameter-efficient way. We conduct extensive experiments on datasets with different numbers of modalities, where Wander outperforms state-of-the-art efficient transfer learning methods consistently. The results fully demonstrate the effectiveness, efficiency and universality of Wander.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems of multimodal models in transfer learning: 1. **Existing methods are only applicable to vision - language tasks and cannot be extended to cases with more modalities**: Existing multimodal transfer learning techniques (such as Sung, Cho, and Bansal 2022; Lu et al. 2024) are mainly limited to fine - tuning vision - language models, only focusing on the interaction between these two modalities, and cannot be applied to multimodal models containing more modalities. 2. **Limited and inefficient use of inter - modal interactions**: When existing multimodal transfer learning strategies are applied to multimodal models, the use of inter - modal interactions is limited and lacks efficiency. Specifically, these methods focus on fusing vector representations from different modalities rather than sequence vector representations from different modalities, ignoring the temporal - dimensional interactions of various modalities. To solve these problems, the authors propose the Low - Rank Sequential Multimodal Adapter (Wander). Wander improves multimodal transfer learning in the following ways: - **Element - level information fusion**: Use the outer product to effectively fuse information from different modalities in an element - level manner. - **Tensor decomposition**: Utilize CP decomposition to decompose tensors into rank - one components, significantly reducing the number of parameters. - **Fine - grained feature extraction**: Extract more fine - grained features and inter - modal sequence relationships through low - rank decomposition at the token level. These designs enable Wander to achieve fine - grained token - level interactions between different modal sequences with parameter efficiency. Experimental results show that Wander consistently outperforms existing efficient transfer learning methods on multiple datasets and demonstrates its effectiveness, efficiency, and generality. ### Formula Summary 1. **Adapter Module**: \[ \text{Adapter}(x) = x + \text{Up}(\text{Nonlinear}(\text{Down}(x))) \] 2. **CP Decomposition**: \[ X = \sum_{r = 1}^{R} \bigotimes_{n = 1}^{N} a_r^n \] where \( R \) is the rank, \( a_r^n\in\mathbb{R}^{d_n} \), and \(\bigotimes_{n = 1}^{N}\) represents the tensor outer product operation. 3. **Outer - Product Multimodal Fusion**: \[ H = \bigotimes_{m = 1}^{M} h_m \] \[ \tilde{H} = W\cdot H + b = W\cdot\left( \bigotimes_{m = 1}^{M} h_m \right)+ b \] where \( W\in\mathbb{R}^{d_1\times d_2\times\cdots\times d_M\times d_h} \), and \( b \) is the bias term of the linear layer. 4. **Low - Rank Single - Vector Fusion**: \[ W_k^h = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{h,m,k} \] \[ W_h = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{h,m} \] 5. **Low - Rank Sequential Fusion**: \[ H_t = H\cdot W_h=\left( \bigotimes_{m = 1}^{M} h_m \right)\cdot W_h \] \[ \tilde{H}_t = W_t\cdot H_t = W_t\cdot H\cdot W_h \] \[ W_t = \sum_{r = 1}^{R} \bigotimes_{m = 1}^{M} w_r^{t,m} \]