Abstract:We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we identify two discrepancies: i) task discrepancy – segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; ii) data discrepancy – box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes (The bottom row in Fig. 1 shows a comparison of the performance of OpenSeeD and previous SoTA methods). Finally, we note that OpenSeeD is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in the open world. Code will be released at https://github.com/IDEA-Research/OpenSeeD.

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

OpenSD: Unified Open-Vocabulary Segmentation and Detection

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

Overcoming Domain Limitations in Open-vocabulary Segmentation

Open-Vocabulary Camouflaged Object Segmentation

Open-Vocabulary Audio-Visual Semantic Segmentation

A Simple Framework for Open-Vocabulary Segmentation and Detection

Auto-Vocabulary Semantic Segmentation

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Towards Open-Vocabulary Video Semantic Segmentation

Diffusion Models for Open-Vocabulary Segmentation

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models