Abstract:Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

Cascade Grouped Attention Network for Referring Expression Segmentation

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Advancing Referring Expression Segmentation Beyond Single Image

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

GRES: Generalized Referring Expression Segmentation

Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Global Selection and Local Attention Network for Referring Image Segmentation.

Multi-level expression guided attention network for referring expression comprehension

Structured Attention Network for Referring Image Segmentation

CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation

LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension

SATR: Semantics-Aware Triadic Refinement Network for Referring Image Segmentation

Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Fully Aligned Network for Referring Image Segmentation

Collaborative Position Reasoning Network for Referring Image Segmentation

3D-GRES: Generalized 3D Referring Expression Segmentation

Dual-graph Hierarchical Interaction Network for Referring Image Segmentation

A Multi-Scale Language Embedding Network for Proposal-Free Referring Expression Comprehension.