Referring Image Segmentation Without Text Annotations

Jing Liu,Huajie Jiang,Yandong Bi,Yongli Hu,Baocai Yin
DOI: https://doi.org/10.1007/978-981-97-5615-5_23
2024-01-01
Abstract:Referring Image Segmentation (RIS) is an essential topic in visual language understanding that aims to segment the target instance in the image referred to by the language description. Conventional RIS methods have relied on expensive manual annotations involving the triplet (image-text-mask), with the acquisition of text annotations posing the most formidable challenge. To eliminate the heavy dependence on human annotations, we propose a novel RIS method, the Referring Image Segmentation without Text Annotations (WoTA), which substitutes textual annotations by generating the pseudo-query through the utilization of visual information. Specifically, we design a novel training-testing scheme that introduces a Pseudo-Query Generation Scheme (PQGS) in the training phase, which relies on the pre-trained cross-modal knowledge in CLIP to generate the pseudo-query related to global and local visual information. In the testing phase, the CLIP text encoder is directly applied to the test statements to generate real query language features. Extensive experiments on several benchmark datasets demonstrate the advantage of the proposed WoTA over several zero-shot baselines of the task and even the weakly supervised referring image segmentation method.
What problem does this paper attempt to address?