Abstract:Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at <a class="link-external link-https" href="https://github.com/rkzheng99/SyncVIS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in video instance segmentation (VIS), especially the limitations of existing methods when dealing with complex and challenging video scenes. Specifically: 1. **Problems with the asynchronous design of existing methods**: - Most current VIS methods adopt an asynchronous design, that is, they model video sequences through video - level queries or frame - level queries. However, these methods are difficult to effectively handle complex multi - frame inputs. For example, when the number of input frames is increased during the training of Mask2Former - VIS, its performance will decline instead, which is contrary to common sense because more frames should provide more motion information. - The asynchronous structure causes video - level queries to rely heavily on the learning quality of frame - level queries, easily losing some motion information, and there are high - complexity problems in the bipartite matching optimization problem between multiple frames. 2. **Complexity of trajectory modeling in long videos**: - When dealing with long videos, existing methods find it difficult for video - level queries to effectively track instances because the trajectory complexity grows polynomially with the number of frames. Therefore, existing methods usually decompose the trajectory into spatial and temporal dimensions, which are modeled by frame - level and video - level queries respectively, but this still cannot well solve the problem of multi - frame inputs. 3. **High optimization complexity**: - When dealing with long videos, the optimization complexity of video - level bipartite matching is very high, and as the number of frames increases, the optimization difficulty grows exponentially. This makes existing methods perform poorly when dealing with complex videos. To solve these problems, the author proposes a new framework named SyncVIS, which improves the effect of video instance segmentation by synchronously modeling video - level and frame - level embeddings. SyncVIS introduces two key modules: - **Synchronous video - frame modeling paradigm**: It enables video - level and frame - level embeddings to interact in each decoder layer, thus avoiding the error accumulation problem in the cascaded structure. - **Synchronous embedding optimization strategy**: It divides large video sequences into small segments for optimization, reducing the optimization complexity while maintaining the synchronization of video - level and frame - level embeddings. Through these two modules, SyncVIS can better represent the trajectories of instances in complex and challenging video scenes and has achieved state - of - the - art results on multiple benchmark datasets. ### Summary The main goal of this paper is to solve the limitations of existing video instance segmentation methods when dealing with complex video scenes by synchronously modeling video - level and frame - level embeddings, thereby improving the performance and robustness of the model.

SyncVIS: Synchronized Video Instance Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TransVOS: Video Object Segmentation with Transformers

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

DVIS: Decoupled Video Instance Segmentation Framework

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Online Video Instance Segmentation via Robust Context Fusion

End-to-End Video Instance Segmentation with Transformers

TCOVIS: Temporally Consistent Online Video Instance Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

A Generalized Framework for Video Instance Segmentation

CTVIS: Consistent Training for Online Video Instance Segmentation

In Defense of Online Models for Video Instance Segmentation

Towards Real-Time Open-Vocabulary Video Instance Segmentation

UVIS: Unsupervised Video Instance Segmentation

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

Improving Video Instance Segmentation via Temporal Pyramid Routing

Eigen-Cluster VIS: Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency

Towards Open-Vocabulary Video Instance Segmentation

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation