Abstract:Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatial-aware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.2 PQ, 31.6 mIoU, and 12.7 FPS on the ADE20K dataset for panoptic and semantic segmentation tasks and the inference time of EOV-Seg is 4-21 times faster than state-of-the-art methods. Especially, equipped with ResNet-50 backbone, EOV-Seg runs 25 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at \url{<a class="link-external link-https" href="https://github.com/nhw649/EOV-Seg" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to achieve efficient open - vocabulary panoptic segmentation with spatial - awareness ability**. Specifically, the existing methods have the following problems when dealing with open - vocabulary panoptic segmentation: 1. **Low computational efficiency**: The two - stage framework incurs large computational overhead by cropping the image multiple times and generating masks. The single - stage framework relies on a heavy - weight mask decoder and uses self - attention and cross - attention mechanisms in multi - layer Transformer blocks to make up for the lack of spatial location information, which also brings a high computational burden. 2. **Lack of spatial - awareness ability**: The existing methods fail to fully utilize the spatial - awareness ability of the visual - language model (VLM) during the feature extraction process, resulting in poor performance in instance recognition and semantic understanding. To solve these problems, the authors propose **EOV - Seg**, a novel single - stage, shared, efficient and spatially - aware framework, aiming to improve the efficiency and performance of open - vocabulary panoptic segmentation. The main innovations of EOV - Seg include: - **Vocabulary - Aware Selection module (VAS)**: By guiding the visual aggregation features to select features more relevant to the text, it reduces the feature interaction burden of the mask decoder, so that a lightweight decoder can be used, reducing the computational requirements and speeding up the inference speed. - **Two - way Dynamic Embedding Experts (TDEE)**: Utilize the spatial - awareness ability of the ViT - based CLIP backbone network, and dynamically evaluate the importance of embedding experts through the weight - allocation router to generate instance embeddings with semantic and spatial - awareness abilities, so as to improve the mask recognition ability. The experimental results show that EOV - Seg not only achieves a performance of 24.2 PQ, 31.6 mIoU and 12.7 FPS on the ADE20K dataset, but also improves the inference speed by 4 to 21 times, reduces the number of parameters by hundreds of M, and significantly reduces the computational complexity compared with the existing state - of - the - art methods. In addition, EOV - Seg also performs well in the semantic segmentation task, proving its versatility and efficiency. In summary, this paper solves the problems of low computational efficiency and lack of spatial - awareness ability in the existing open - vocabulary panoptic segmentation methods by proposing the EOV - Seg framework, achieving faster and more efficient segmentation performance.

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Open-vocabulary Panoptic Segmentation with Embedding Modulation

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Open-Vocabulary Camouflaged Object Segmentation

A Simple Framework for Open-Vocabulary Segmentation and Detection

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

PVO: Panoptic Visual Odometry.

Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

OpenSD: Unified Open-Vocabulary Segmentation and Detection

Open Panoramic Segmentation

Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation

VEON: Vocabulary-Enhanced Occupancy Prediction

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images