Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Fengyuan Shi,Ruopeng Gao,Weilin Huang,Limin Wang

2023-10-26

Abstract:Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the computational efficiency issues in visual grounding tasks. Specifically: 1. **Problems with Existing Methods**: - Existing encoder-only frameworks (such as TransVG) have high computational complexity due to the self-attention mechanism when handling visual grounding, leading to excessive computation. - There are many spatially redundant regions in images that do not contribute much to the final prediction but still participate in the computation. 2. **Proposed Method**: - A new multimodal transformer architecture, called "Dynamic Multimodal DETR (Dynamic MDETR)," is proposed to improve computational efficiency by decoupling the entire grounding process into encoding and decoding stages. - The dynamic decoder leverages the sparsity prior in images to accelerate the visual grounding process through a 2D adaptive sampling module and a text-guided decoding module. 3. **Specific Improvements**: - The 2D adaptive sampling module can select a small number of information-rich image regions for processing, significantly reducing computational costs. - The text-guided decoding module can extract information about the target object under the guidance of the text, achieving more accurate grounding. 4. **Experimental Results**: - Experiments show that using Dynamic MDETR can reduce computation by approximately 44% while still maintaining or even improving grounding accuracy. - With the same number of encoding layers, Dynamic MDETR (based on ResNet-50) outperforms TransVG (based on ResNet-101) with lower additional computational costs. In summary, this paper aims to solve the computational efficiency issues in existing visual grounding methods by introducing a new dynamic multimodal decoder and validates its effectiveness and generalization ability on multiple benchmark datasets.

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Dynamic Inference with Grounding Based Vision and Language Models

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

TransVG: End-to-End Visual Grounding with Transformers

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Transformer-based Visual Grounding with Cross-modality Interaction

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Improving visual grounding with multi-scale discrepancy information and centralized-transformer

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

CTFCD: Channel Transformer Based on Full Convolutional Decoder for Single Image Deraining

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Visual Grounding With Joint Multimodal Representation and Interaction

GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation