MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

Ting Liu,Zunnan Xu,Yue Hu,Liangtao Shi,Zhiqiang Wang,Quanjun Yin

2024-10-06

Abstract:Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by an aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters. Our code is available at <a class="link-external link-https" href="https://github.com/liuting20/MaPPER" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper aims to address two main issues in the Referring Expression Comprehension (REC) task: 1. **Drawbacks of Full Fine-tuning**: Traditional full fine-tuning methods not only disrupt the rich prior knowledge in pre-trained models but also lead to significant computational cost increases, especially when dealing with large-scale foundational models. This makes it difficult for researchers to apply these models with limited hardware resources. 2. **Limitations of Existing Parameter-Efficient Transfer Learning (PETL) Methods**: Directly applying existing PETL methods to the REC task is not suitable because these methods lack precise local visual perception capabilities and effective visual-language alignment capabilities. To address these issues, the authors propose a new framework—MaPPER (Multimodal Prior-guided Parameter Efficient Tuning). This framework enhances visual perception by introducing a Dynamic Prior Adapter (DyPA) and a Local Convolution Adapter (LoCA), and further promotes cross-modal alignment through a Prior-Guided Text Module (PGT). Experimental results show that MaPPER achieves state-of-the-art performance on three widely used benchmarks while only requiring adjustment of 1.41% of the parameters in the pre-trained backbone network.

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

When Parameter-efficient Tuning Meets General-purpose Vision-language Models

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Towards a Unified View on Visual Parameter-Efficient Transfer Learning

CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models

Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

PVP: Pre-trained Visual Parameter-Efficient Tuning

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

BarLeRIa: an Efficient Tuning Framework for Referring Image Segmentation.

Gradient Projection For Continual Parameter-Efficient Tuning

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

PARA: Parameter-Efficient Fine-tuning with Prompt-Aware Representation Adjustment

VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients

ConPET: Continual Parameter-Efficient Tuning for Large Language Models

Rethinking Efficient Tuning Methods from a Unified Perspective

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts

A Unified Continual Learning Framework with General Parameter-Efficient Tuning

Towards Efficient Visual Adaption via Structural Re-parameterization