Language Adaptive Weight Generation for Multi-task Visual Grounding

Wei Su,Peihan Miao,Huanzhang Dou,Gaoang Wang,Liang Qiao,Zheyang Li,Xi Li

DOI: https://doi.org/10.48550/arXiv.2306.04652

2023-06-06

Computer Vision and Pattern Recognition

Abstract:Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper The paper aims to address a key issue in the task of visual grounding: existing methods typically utilize the visual backbone in a passive manner, i.e., using fixed weights when extracting features without considering the relevant information from expressions. This passive perception may lead to a mismatch between feature extraction and the required expressions, thereby limiting further performance improvement. **Specific Issues:** - **Passivity in Feature Extraction**: Current methods use fixed weights when extracting visual features and cannot adjust based on specific natural language descriptions, which may result in redundant or missing feature extraction. - **Complexity in Cross-Modal Interaction Module Design**: Many existing methods rely on complex cross-modal interaction modules to compensate for the shortcomings in feature extraction, which increases the complexity of the network structure. To address these issues, the paper proposes an active perception visual grounding framework based on Language Adaptive Weights (VG-LAW). This framework dynamically generates weights that adapt to specific expressions, enabling the visual backbone to actively extract relevant visual features without the need for additional cross-modal interaction modules. Moreover, the framework includes a concise and efficient multi-task prediction head that can handle both Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks simultaneously. Experimental results show that this method achieves state-of-the-art performance on multiple benchmark datasets.

Language Adaptive Weight Generation for Multi-task Visual Grounding

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Learning Visual Grounding from Generative Vision and Language Model

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Joint Visual Grounding with Language Scene Graphs

Transformer-based Visual Grounding with Cross-modality Interaction

Visual Grounding with Attention-Driven Constraint Balancing

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations