Abstract:Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

Referring Image Segmentation Without Text Annotations

Referring Image Segmentation Using Text Supervision

Toward Robust Referring Image Segmentation

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation

Towards Robust Referring Image Segmentation.

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Cross-modal Transformer with Language Query for Referring Image Segmentation

Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

SATR: Semantics-Aware Triadic Refinement Network for Referring Image Segmentation

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Beyond One-to-One: Rethinking the Referring Image Segmentation

CRIS: CLIP-Driven Referring Image Segmentation

Zero-shot Referring Image Segmentation with Global-Local Context Features

RRSIS: Referring Remote Sensing Image Segmentation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

Adaptive Selection Based Referring Image Segmentation

PTQ4RIS: Post-Training Quantization for Referring Image Segmentation

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation