Abstract:Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained uni-modal encoders fixed, updating M$^3$ISAs on side networks to progressively connect them, enabling more comprehensive vision-language alignment and efficient tuning for REC. Empirical results reveal that M$^2$IST achieves an optimal balance between performance and efficiency compared to most full fine-tuning and other PETL methods. With M$^2$IST, standard transformer-based REC methods present competitive or even superior performance compared to full fine-tuning, while utilizing only 2.11\% of the tunable parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time required for full fine-tuning.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems in the Referring Expression Comprehension (REC) task: 1. **Insufficient multi - modal interaction**: When existing Parameter - efficient transfer learning (PETL) methods are applied to REC, they cannot fully achieve multi - modal interaction between visual and language models. This is because pre - trained visual - language base models are trained separately, each with different structures and training data. Directly inserting ordinary adapters may lead to insufficient cross - modal interaction, especially in the shallow layers, thus affecting the prediction accuracy of complex semantic information (such as human actions and spatial relationships). 2. **High GPU memory consumption**: When fine - tuning REC, existing PETL methods still need to pass gradients through large pre - trained visual and language models, which places a heavy burden on GPU memory. Specifically, due to the need for back - propagation of gradients, the GPU memory usage is too high, especially during the fine - tuning stage. To solve these problems, the authors propose M2IST (Multi - Modal Interactive Side - Tuning), a new multi - modal interactive side - tuning method. M2IST enhances the interaction between visual and language models and reduces GPU memory usage by introducing M3ISA (Mixture of Multi - Modal Interactive Side - Adapters). Specifically, M2IST keeps the pre - trained unimodal encoders fixed and only updates M3ISA in the side network, thereby achieving more comprehensive visual - language alignment and efficient fine - tuning. Through this method, M2IST can not only achieve efficient adjustment in terms of parameters, memory, and time, but also achieve performance comparable to or even better than full fine - tuning in the REC task, while using only 2.11% of the tunable parameters, 39.61% of the GPU memory, and 63.46% of the fine - tuning time. ### Summary The main contributions of this paper include: 1. Proposing M2IST, a new multi - modal interactive side - tuning method, which effectively solves the problems of insufficient multi - modal interaction and high GPU memory consumption in the REC task. 2. Designing M3ISA, which seamlessly integrates pre - trained visual and language encoders, achieving parameter -, memory -, and time - efficient fine - tuning. 3. Verifying the effectiveness and efficiency of M2IST on three widely - used benchmark datasets through extensive experiments, proving that it achieves the optimal balance between performance and efficiency.

M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

When Parameter-efficient Tuning Meets General-purpose Vision-language Models

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Hierarchical Side-Tuning for Vision Transformers

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning

Sparse Structure Search for Parameter-Efficient Tuning

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models