Abstract:We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we identify two discrepancies: i) task discrepancy – segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; ii) data discrepancy – box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes (The bottom row in Fig. 1 shows a comparison of the performance of OpenSeeD and previous SoTA methods). Finally, we note that OpenSeeD is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in the open world. Code will be released at https://github.com/IDEA-Research/OpenSeeD.

Semi-Open Set Object Detection Algorithm Leveraged by Multi-Modal Large Language Models

A Lightweight SE-YOLOv3 Network for Multi-Scale Object Detection in Remote Sensing Imagery.

Multiclass objects detection algorithm using DarkNet-53 and DenseNet for intelligent vehicles

Object Detectors in the Open Environment: Challenges, Solutions, and Outlook

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Open-Set 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Towards Open-set Camera 3D Object Detection

A Simple Framework for Open-Vocabulary Segmentation and Detection

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Open-Set Object Detection Using Classification-free Object Proposal and Instance-level Contrastive Learning

YOLOOC: YOLO-based Open-Class Incremental Object Detection with Novel Class Discovery

CF-YOLOX: An Autonomous Driving Detection Model for Multi-Scale Object Detection

Multi-scene small object detection with modified YOLOv4

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Towards Evidential and Class Separable Open Set Object Detection

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

More Pictures Say More: Visual Intersection Network for Open Set Object Detection