Abstract:Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of handling complex expressions in **Generalized Referring Expression Segmentation (GRES)**, especially when the expressions involve multiple different objects. Specifically: 1. **Limitations of existing methods**: - Existing GRES methods usually adopt an end - to - end foreground - background segmentation approach and lack a mechanism to clearly distinguish and correlate different object instances. - These methods perform poorly when dealing with complex expressions (for example, "the girl who holds a small dog and the dog on the right"), and often fail to accurately segment each described object. 2. **Objectives**: - Propose a new method that can explicitly distinguish and correlate different object instances in the text query, thereby performing multi - object segmentation more accurately. - Solve the deficiencies of existing methods in handling complex multi - object expressions and improve the performance of the model in handling complex scenarios. ### Solution To solve the above problems, the paper proposes **InstAlign**, a GRES model that introduces instance - level reasoning. Its main innovations include: 1. **Instance - aware segmentation framework**: - InstAlign extracts a set of object - level tokens by combining text and image inputs. These tokens capture the semantic information of the input prompt and the objects in the image. - Each token uniquely represents an object fragment in the image and is aligned with the relevant semantic information in the text. 2. **Phrase - Object Alignment mechanism**: - Introduce the Phrase - Object Alignment loss function to ensure that each segmented object is precisely aligned with a specific semantic part in the input text. - Through this alignment mechanism, the model can not only identify and segment the correct objects but also capture the fine - grained relationships between text phrases and visual instances. 3. **Adaptive Instance Aggregation module**: - Develop an Adaptive Instance Aggregation (AIA) module that dynamically integrates segmented object instances according to the correlation scores, improving the overall segmentation performance. - Enhance the robustness and segmentation accuracy of the model in complex scenarios. 4. **No - target Predictor**: - Introduce a No - target Predictor to accurately predict whether the input expression points to any target in the image. Through these innovations, InstAlign significantly improves the ability to handle complex referring expressions and achieves state - of - the - art performance in standard GRES benchmark tests.

Instance-Aware Generalized Referring Expression Segmentation

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

GRES: Generalized Referring Expression Segmentation

Advancing Referring Expression Segmentation Beyond Single Image

Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

3D-GRES: Generalized 3D Referring Expression Segmentation

CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

GSVA: Generalized Segmentation via Multimodal Large Language Models

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

Text-Vision Relationship Alignment for Referring Image Segmentation

RRSIS: Referring Remote Sensing Image Segmentation

Locate then Segment: A Strong Pipeline for Referring Image Segmentation

Instance-Specific Feature Propagation for Referring Segmentation

Mask Grounding for Referring Image Segmentation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

Referring Image Segmentation via Text Guided Multi-Level Interaction

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation