Language Adaptive Weight Generation for Multi-task Visual Grounding

Wei Su,Peihan Miao,Huanzhang Dou,Gaoang Wang,Liang Qiao,Zheyang Li,Xi Li
DOI: https://doi.org/10.48550/arXiv.2306.04652
2023-06-06
Computer Vision and Pattern Recognition
Abstract:Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.
What problem does this paper attempt to address?
### The Problem Addressed by the Paper The paper aims to address a key issue in the task of visual grounding: existing methods typically utilize the visual backbone in a passive manner, i.e., using fixed weights when extracting features without considering the relevant information from expressions. This passive perception may lead to a mismatch between feature extraction and the required expressions, thereby limiting further performance improvement. **Specific Issues:** - **Passivity in Feature Extraction**: Current methods use fixed weights when extracting visual features and cannot adjust based on specific natural language descriptions, which may result in redundant or missing feature extraction. - **Complexity in Cross-Modal Interaction Module Design**: Many existing methods rely on complex cross-modal interaction modules to compensate for the shortcomings in feature extraction, which increases the complexity of the network structure. To address these issues, the paper proposes an active perception visual grounding framework based on Language Adaptive Weights (VG-LAW). This framework dynamically generates weights that adapt to specific expressions, enabling the visual backbone to actively extract relevant visual features without the need for additional cross-modal interaction modules. Moreover, the framework includes a concise and efficient multi-task prediction head that can handle both Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks simultaneously. Experimental results show that this method achieves state-of-the-art performance on multiple benchmark datasets.