Abstract:Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic visual backbones to extract features independently without considering the specific text input. However, we argue that the extracted visual features can be inconsistent with the referring expression, which hurts multi-modal understanding. To address this, we first propose Query-modulated Refinement Network (QRNet) that leverages language guidance to guide visual feature extraction. However, it only focuses on the grounding task that can only provide coarse-grained annotations in the form of bounding box coordinates. The guidance for the visual backbone is indirect, and the inconsistent issue still exists. To this end, we further propose UniQRNet, a multi-task framework over the QRNet to learn referring expression grounding and segmentation jointly. The framework introduces a multi-task head that leverages fine-grained pixel-level supervision from the segmentation task to directly guide the intermediate layers of QRNet to learn text-consistent visual features. Besides, UniQRNet also includes a loss balance strategy that allows two types of supervision signals to cooperate and optimize the model together. We conduct the most comprehensive comparison experiment covering four major datasets, ten evaluation set and three evaluation metrics used in previous work. UniQRNet outperforms previous state-of-the-art methods by a large margin on both referring comprehensive grounding (1.8%~5.09%) and segmentation tasks (0.57%~5.56%). Ablation and analysis reveal that UniQRNet can improve the consistency of visual features with text input and can bring significant performance improvement.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of inconsistency between visual features and text queries, particularly in the task of Referring Expression Comprehension (REC). Specifically: 1. **Problems with Existing Methods**: - Existing methods typically use a text-agnostic visual backbone network to extract features independently, without considering specific text input. - This approach can lead to visual features that are inconsistent with the referring expressions, thereby affecting the effectiveness of multimodal understanding. 2. **Proposed New Method**: - The authors first propose a Query-modulated Refinement Network (QRNet) that uses language guidance to adjust visual feature extraction. - However, relying solely on the bounding box coordinates provided by the referring expression localization task cannot directly guide the visual backbone network to extract features consistent with the query. 3. **UniQRNet Framework**: - To further address this issue, the authors propose a multitask framework named UniQRNet, which extends QRNet by simultaneously learning referring expression localization and segmentation tasks. - By introducing pixel-level supervision signals, the framework directly guides the learning of intermediate layers, thereby improving the consistency between visual features and text. - Additionally, UniQRNet includes a loss balancing strategy that allows the two types of supervision signals to collaboratively optimize the model. 4. **Experimental Results**: - Comprehensive experiments were conducted on four mainstream datasets, where UniQRNet achieved an absolute improvement of 1.8% to 5.09% in the referring expression localization task and 0.57% to 5.56% in the segmentation task. ### Main Contributions 1. **Identifying and Solving the Inconsistency Issue**: By collaboratively training QRNet with multi-granularity supervision signals, the paper addresses the inconsistency between the visual backbone network and text queries. 2. **Proposing a New Image Representation Backbone**: The introduction of the Query-modulated Refinement Network, and the adjustment of features through the Query-aware Dynamic Attention module. 3. **Multitask Head Design**: Utilizing pixel-level supervision signals to directly guide QRNet in learning visual features consistent with the text, and introducing a loss balancing strategy. 4. **Extensive Experimental Validation**: Comprehensive experimental validation on four mainstream datasets, demonstrating the effectiveness of the method, which can serve as a benchmark for future research.

UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Unifying 3D Vision-Language Understanding via Promptable Queries

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

SeqTR: A Simple Yet Universal Network for Visual Grounding

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

Uni3DL: Unified Model for 3D and Language Understanding

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

A Unified Framework for 3D Scene Understanding

RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

Unified Referring Expression Generation for Bounding Boxes and Segmentations

VSRNet: End-to-end Video Segment Retrieval with Text Query

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

MQANet: Multi-Task Quadruple Attention Network of Multi-Object Semantic Segmentation from Remote Sensing Images