Abstract:We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the cost of manual annotation while improving the performance and generalization ability of the Referring Image Segmentation (RIS) task. Specifically: 1. **Reduce the cost of manual annotation**: Traditional RIS methods rely heavily on expensive and time - consuming manually - annotated datasets, which include pixel - level instance masks and their corresponding referring descriptions. This dependence not only increases the training cost but also limits the generalization ability of the model in new domains. 2. **Improve performance and generalization ability**: Although existing weakly - supervised and zero - shot RIS methods attempt to reduce the dependence on manual annotation, they still have deficiencies in precision and generalization ability. For example, weakly - supervised methods use image - text pairs for training but lack spatial location information; zero - shot methods utilize large - scale pre - trained models, but due to the ambiguity of scene - level captions, it is difficult to accurately locate specific objects. To solve these problems, the authors propose a new framework - Pseudo - RIS, which can automatically generate high - quality segmentation masks and their corresponding referring expressions as pseudo - supervised data. By combining multiple base models (such as SAM [36] and CoCa [106]), Pseudo - RIS not only generates accurate pseudo - supervised data but also successfully solves the generalization problem in supervised RIS. ### Specific solutions 1. **Accurate mask extraction**: Use a segmentation base model (such as SAM) to extract high - quality instance - level segmentation masks from unannotated images. 2. **Generate unique referring expressions**: - **Unique caption sampling**: Propose a new decoding strategy, called "unique caption sampling", to generate candidate captions containing detailed features of the target area. - **Uniqueness - based text filtering**: Further verify the uniqueness of the generated captions and filter out unclear or inaccurate captions. Through these steps, Pseudo - RIS can generate high - quality pseudo - supervised data suitable for RIS training, thereby achieving state - of - the - art performance in various RIS configurations and showing strong generalization ability in open - world challenges. ### Experimental results Experiments show that Pseudo - RIS significantly outperforms existing zero - shot and weakly - supervised methods on multiple benchmark datasets (such as RefCOCO, RefCOCO +, RefCOCOg and PhraseCut), and also performs well in cross - domain and open - world settings. In particular, experiments on the PhraseCut dataset show that Pseudo - RIS can maintain stable performance on unseen object categories and even outperform fully - supervised methods. In summary, this paper aims to reduce the cost of manual annotation and improve the performance and generalization ability of the RIS task by automatically generating high - quality pseudo - supervised data.

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Toward Robust Referring Image Segmentation

Towards Robust Referring Image Segmentation.

Distillation and Supplementation of Features for Referring Image Segmentation

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

A Simple Baseline with Single-encoder for Referring Image Segmentation

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Mask Grounding for Referring Image Segmentation

Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

RRSIS: Referring Remote Sensing Image Segmentation

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

HARIS: Human-Like Attention for Reference Image Segmentation

Synthetic Instance Segmentation from Semantic Image Segmentation Masks

Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation

Extending CLIP's Image-Text Alignment to Referring Image Segmentation

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation