Abstract:Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here <a class="link-external link-https" href="https://github.com/haochenheheda/LVVIS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of ability of existing Video Instance Segmentation (VIS) methods when dealing with unseen new - category objects. Traditional VIS methods can only segment and classify the fixed categories known in the training set, lacking the ability to generalize to new categories in the real world. To address this limitation, the paper proposes the Open - Vocabulary Video Instance Segmentation (Open - Vocabulary VIS) task, aiming to simultaneously segment, track, and classify open - set category objects in videos, including those new categories not seen during the training stage. Specifically, the main contributions of the paper include: 1. **Introducing the Open - Vocabulary VIS task**: This task requires the model to be able to handle the known categories in the training set, and also requires it to be able to effectively segment, track, and classify new categories during testing. 2. **Constructing the large - scale dataset LV - VIS**: To evaluate the Open - Vocabulary VIS task, the paper has collected a large - scale dataset LV - VIS containing 1,196 different categories, significantly exceeding the number of categories in existing datasets. 3. **Proposing the OV2Seg model**: This is an end - to - end model, using the Memory - Induced Transformer architecture, and can achieve the Open - Vocabulary VIS task at near - real - time speed. OV2Seg achieves this goal through the following three modules: - **Universal Object Proposal module**: Used to propose and segment objects of all categories. - **Memory - Induced Tracking module**: Dynamically aggregates object features through Memory Queries to achieve long - term tracking. - **Open - Vocabulary Classification module**: Utilizes text embeddings generated by a pre - trained text encoder to classify the tracked objects. Through these contributions, the paper aims to improve the practicality and generalization ability of the VIS task in the real world, especially the performance when dealing with new - category objects.

Towards Open-Vocabulary Video Instance Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TransVOS: Video Object Segmentation with Transformers

Towards Open-Vocabulary Video Semantic Segmentation

OpenVIS: Open-vocabulary Video Instance Segmentation

Towards Real-Time Open-Vocabulary Video Instance Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Open-Vocabulary Audio-Visual Semantic Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Video Instance Segmentation in an Open-World

Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation.

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

End-to-End Video Instance Segmentation with Transformers

Open-Vocabulary Camouflaged Object Segmentation

Towards Open Vocabulary Learning: A Survey

UVIS: Unsupervised Video Instance Segmentation

Scalable Video Object Segmentation with Identification Mechanism

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation