Abstract:Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

Phrase Grounding Algorithm Based on Transformer Multilevel Feature Fusion

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase Grounding by Soft-Label Chain Conditional Random Field

Visual Grounding With Joint Multimodal Representation and Interaction

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Disentangled Motif-aware Graph Learning for Phrase Grounding

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Transformer-based Visual Grounding with Cross-modality Interaction

Feature Fusion Based on Transformer for Cross-modal Retrieval

Cross-Modal Omni Interaction Modeling for Phrase Grounding

Multilevel Transformer For Multimodal Emotion Recognition

MFSC: A Multimodal Aspect-Level Sentiment Classification Framework with Multi-Image Gate and Fusion Networks

Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Improving visual grounding with multi-scale discrepancy information and centralized-transformer

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding