Instance-Aware Generalized Referring Expression Segmentation

E-Ro Nguyen,Hieu Le,Dimitris Samaras,Michael Ryoo
2024-11-23
Abstract:Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of handling complex expressions in **Generalized Referring Expression Segmentation (GRES)**, especially when the expressions involve multiple different objects. Specifically: 1. **Limitations of existing methods**: - Existing GRES methods usually adopt an end - to - end foreground - background segmentation approach and lack a mechanism to clearly distinguish and correlate different object instances. - These methods perform poorly when dealing with complex expressions (for example, "the girl who holds a small dog and the dog on the right"), and often fail to accurately segment each described object. 2. **Objectives**: - Propose a new method that can explicitly distinguish and correlate different object instances in the text query, thereby performing multi - object segmentation more accurately. - Solve the deficiencies of existing methods in handling complex multi - object expressions and improve the performance of the model in handling complex scenarios. ### Solution To solve the above problems, the paper proposes **InstAlign**, a GRES model that introduces instance - level reasoning. Its main innovations include: 1. **Instance - aware segmentation framework**: - InstAlign extracts a set of object - level tokens by combining text and image inputs. These tokens capture the semantic information of the input prompt and the objects in the image. - Each token uniquely represents an object fragment in the image and is aligned with the relevant semantic information in the text. 2. **Phrase - Object Alignment mechanism**: - Introduce the Phrase - Object Alignment loss function to ensure that each segmented object is precisely aligned with a specific semantic part in the input text. - Through this alignment mechanism, the model can not only identify and segment the correct objects but also capture the fine - grained relationships between text phrases and visual instances. 3. **Adaptive Instance Aggregation module**: - Develop an Adaptive Instance Aggregation (AIA) module that dynamically integrates segmented object instances according to the correlation scores, improving the overall segmentation performance. - Enhance the robustness and segmentation accuracy of the model in complex scenarios. 4. **No - target Predictor**: - Introduce a No - target Predictor to accurately predict whether the input expression points to any target in the image. Through these innovations, InstAlign significantly improves the ability to handle complex referring expressions and achieves state - of - the - art performance in standard GRES benchmark tests.