Abstract:Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at <a class="link-external link-https" href="https://github.com/strongwolf/OpenSD" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in open-vocabulary image segmentation and detection tasks: 1. **Task Conflict**: Existing open-vocabulary methods face semantic conflicts when handling different tasks, resulting in performance that is inferior to task-specific models. 2. **Insufficient Utilization of CLIP**: Current methods fail to fully utilize CLIP (a pre-trained multimodal model), especially when dealing with open-vocabulary tasks. 3. **Lack of Generality**: Most existing methods lack flexibility and cannot generalize across different tasks, requiring retraining of the model for each new task. To overcome these issues, the authors propose a transformer-based general framework—OpenSD, which can handle open-vocabulary segmentation and detection tasks under the same architecture and network parameters. Specifically, OpenSD achieves this goal through the following two key techniques: 1. **Decoupled Learning Strategy**: Introduces a decoder decoupled learning strategy to reduce semantic conflicts between different tasks, thereby improving the learning efficiency of each task. 2. **Dual Classifiers**: Proposes dual classifiers to handle in-vocabulary and out-of-vocabulary tasks separately, further enhancing the utilization of CLIP. By decoupling prompt learning, the text encoder gains region-aware capabilities, effectively filtering out low-quality and duplicate predictions. ### Main Contributions - **Unified Framework**: OpenSD provides a unified framework capable of handling multiple segmentation and detection tasks simultaneously, including panoptic segmentation, instance segmentation, semantic segmentation, and object detection. - **High Performance**: Experimental results show that OpenSD outperforms existing open-vocabulary segmentation and detection methods across multiple datasets in both closed-vocabulary and open-vocabulary settings. - **Flexibility**: OpenSD exhibits good flexibility, able to generalize across different tasks without the need to retrain the model. ### Experimental Results - **Closed-Vocabulary Setting**: Experimental results on the COCO dataset show that OpenSD achieves excellent performance in tasks such as panoptic segmentation, instance segmentation, and object detection. - **Open-Vocabulary Setting**: In transfer experiments from COCO to ADE20K and Cityscapes datasets, OpenSD demonstrates strong generalization capabilities, particularly excelling in handling unseen categories. In summary, this paper effectively addresses key issues in open-vocabulary image segmentation and detection tasks by proposing the OpenSD framework, providing new directions for future research.

OpenSD: Unified Open-Vocabulary Segmentation and Detection

A Simple Framework for Open-Vocabulary Segmentation and Detection

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Going Denser with Open-Vocabulary Part Segmentation

Open-vocabulary Panoptic Segmentation with Embedding Modulation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

Open-Vocabulary Camouflaged Object Segmentation

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation.

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation