Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Juncheng Yang,Zuchao Li,Shuai Xie,Weiping Zhu,Wei Yu,Shijun Li

2024-04-19

Abstract:Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the challenges in parameter - efficient transfer learning for Vision - Language Models (VLMs). Specifically, traditional transfer learning methods usually require fine - tuning all the parameters of the model, which will face significant consumption of time and space resources as the model scale increases, and may lead to over - fitting when the number of samples is limited. To deal with these problems, the paper proposes a new cross - modal parameter - efficient transfer learning method - XMAdapter. The main contributions of XMAdapter are as follows: 1. **Cross - modal cache model**: Cache models for images and texts are constructed respectively, and clues are extracted by retrieving visual and language bimodal information, achieving effective cross - modal information fusion. 2. **Dynamically adjust the fusion ratio**: By setting different image and text fusion ratios, the similarity measurement methods between different modalities are decoupled, so as to better handle difficult - to - classify samples. 3. **Hard sample mining**: Based on the differences in affinity between modalities, the learning intensity of difficult samples is dynamically adjusted to further improve the performance of the model. Experimental results show that XMAdapter outperforms existing adapter methods on multiple benchmark datasets, especially in terms of accuracy, generalization ability and efficiency.

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Cross-Modal Adapter for Text-Video Retrieval

Multi-Modal Adapter for Vision-Language Models

Parameter-Efficient Transfer Learning for Audio-Visual-Language Tasks.

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter

Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

Efficient Transfer Learning for Video-language Foundation Models

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter

Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition

Towards Efficient Visual Adaption via Structural Re-parameterization