Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Seonghoon Yu,Paul Hongsuck Seo,Jeany Son
2024-07-17
Abstract:We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the cost of manual annotation while improving the performance and generalization ability of the Referring Image Segmentation (RIS) task. Specifically: 1. **Reduce the cost of manual annotation**: Traditional RIS methods rely heavily on expensive and time - consuming manually - annotated datasets, which include pixel - level instance masks and their corresponding referring descriptions. This dependence not only increases the training cost but also limits the generalization ability of the model in new domains. 2. **Improve performance and generalization ability**: Although existing weakly - supervised and zero - shot RIS methods attempt to reduce the dependence on manual annotation, they still have deficiencies in precision and generalization ability. For example, weakly - supervised methods use image - text pairs for training but lack spatial location information; zero - shot methods utilize large - scale pre - trained models, but due to the ambiguity of scene - level captions, it is difficult to accurately locate specific objects. To solve these problems, the authors propose a new framework - Pseudo - RIS, which can automatically generate high - quality segmentation masks and their corresponding referring expressions as pseudo - supervised data. By combining multiple base models (such as SAM [36] and CoCa [106]), Pseudo - RIS not only generates accurate pseudo - supervised data but also successfully solves the generalization problem in supervised RIS. ### Specific solutions 1. **Accurate mask extraction**: Use a segmentation base model (such as SAM) to extract high - quality instance - level segmentation masks from unannotated images. 2. **Generate unique referring expressions**: - **Unique caption sampling**: Propose a new decoding strategy, called "unique caption sampling", to generate candidate captions containing detailed features of the target area. - **Uniqueness - based text filtering**: Further verify the uniqueness of the generated captions and filter out unclear or inaccurate captions. Through these steps, Pseudo - RIS can generate high - quality pseudo - supervised data suitable for RIS training, thereby achieving state - of - the - art performance in various RIS configurations and showing strong generalization ability in open - world challenges. ### Experimental results Experiments show that Pseudo - RIS significantly outperforms existing zero - shot and weakly - supervised methods on multiple benchmark datasets (such as RefCOCO, RefCOCO +, RefCOCOg and PhraseCut), and also performs well in cross - domain and open - world settings. In particular, experiments on the PhraseCut dataset show that Pseudo - RIS can maintain stable performance on unseen object categories and even outperform fully - supervised methods. In summary, this paper aims to reduce the cost of manual annotation and improve the performance and generalization ability of the RIS task by automatically generating high - quality pseudo - supervised data.