Abstract:Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatial-aware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.2 PQ, 31.6 mIoU, and 12.7 FPS on the ADE20K dataset for panoptic and semantic segmentation tasks and the inference time of EOV-Seg is 4-21 times faster than state-of-the-art methods. Especially, equipped with ResNet-50 backbone, EOV-Seg runs 25 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at \url{<a class="link-external link-https" href="https://github.com/nhw649/EOV-Seg" rel="external noopener nofollow">this https URL</a>}.

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

A Multi-Scale Recurrent Framework for Motion Segmentation With Event Camera

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

Event-Free Moving Object Segmentation from Moving Ego Vehicle

OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Towards Open-Vocabulary Video Semantic Segmentation

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Open-Vocabulary Remote Sensing Image Semantic Segmentation

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

Event-guided Low-light Video Semantic Segmentation

Event-assisted Low-Light Video Object Segmentation

Open-Vocabulary Camouflaged Object Segmentation

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

PL-EVIO: Robust Monocular Event-based Visual Inertial Odometry with Point and Line Features

Video Instance Segmentation in an Open-World

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation