Abstract:Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

Adaptive Selection Based Referring Image Segmentation

Fully Aligned Network for Referring Image Segmentation

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

RRSIS: Referring Remote Sensing Image Segmentation

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Towards Robust Referring Image Segmentation.

Distillation and Supplementation of Features for Referring Image Segmentation

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

Toward Robust Referring Image Segmentation

Spatial Semantic Recurrent Mining for Referring Image Segmentation

Rethinking the Implicit Optimization Paradigm with Dual Alignments for Referring Remote Sensing Image Segmentation

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Instance-Specific Feature Propagation for Referring Segmentation

Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

Text-Vision Relationship Alignment for Referring Image Segmentation