Abstract:Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

Distillation and Supplementation of Features for Referring Image Segmentation

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Toward Robust Referring Image Segmentation

Towards Robust Referring Image Segmentation.

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Mask Grounding for Referring Image Segmentation

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Spatial Semantic Recurrent Mining for Referring Image Segmentation

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

RRSIS: Referring Remote Sensing Image Segmentation

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

A Simple Baseline with Single-encoder for Referring Image Segmentation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Fully Aligned Network for Referring Image Segmentation

Remote sensing image instance segmentation network with transformer and multi-scale feature representation

Causality-guided Step-wise Intervention and Reweighting for Remote Sensing Image Semantic Segmentation

Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation

Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation