Abstract:Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a cross-modal progressive comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively. Our code is available at https://github.com/spyflying/CMPC-Refseg.

A Cross-modality and Progressive Person Search System

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

Text-based person search via cross-modal alignment learning

Hybrid Attention Network for Language-Based Person Search

A cross-view intelligent person search method based on multi-feature constraints

Attentive Multi-Granularity Perception Network for Person Search

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Multi-granularity Matching Transformer for Text-based Person Search

Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search

Neural Person Search Machines

Text-Based Person Search with Limited Data

Exploring Visual Context for Weakly Supervised Person Search

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Prompting Continual Person Search

Improving Inconspicuous Attributes Modeling for Person Search by Language

Scene-Adaptive Person Search via Bilateral Modulations

Cross-Modal Progressive Comprehension for Referring Segmentation

Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search

CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Re-Identification

Prototype-Guided Text-based Person Search based on Rich Chinese Descriptions