SyncVIS: Synchronized Video Instance Segmentation

Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao
2024-12-02
Abstract:Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at <a class="link-external link-https" href="https://github.com/rkzheng99/SyncVIS" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in video instance segmentation (VIS), especially the limitations of existing methods when dealing with complex and challenging video scenes. Specifically: 1. **Problems with the asynchronous design of existing methods**: - Most current VIS methods adopt an asynchronous design, that is, they model video sequences through video - level queries or frame - level queries. However, these methods are difficult to effectively handle complex multi - frame inputs. For example, when the number of input frames is increased during the training of Mask2Former - VIS, its performance will decline instead, which is contrary to common sense because more frames should provide more motion information. - The asynchronous structure causes video - level queries to rely heavily on the learning quality of frame - level queries, easily losing some motion information, and there are high - complexity problems in the bipartite matching optimization problem between multiple frames. 2. **Complexity of trajectory modeling in long videos**: - When dealing with long videos, existing methods find it difficult for video - level queries to effectively track instances because the trajectory complexity grows polynomially with the number of frames. Therefore, existing methods usually decompose the trajectory into spatial and temporal dimensions, which are modeled by frame - level and video - level queries respectively, but this still cannot well solve the problem of multi - frame inputs. 3. **High optimization complexity**: - When dealing with long videos, the optimization complexity of video - level bipartite matching is very high, and as the number of frames increases, the optimization difficulty grows exponentially. This makes existing methods perform poorly when dealing with complex videos. To solve these problems, the author proposes a new framework named SyncVIS, which improves the effect of video instance segmentation by synchronously modeling video - level and frame - level embeddings. SyncVIS introduces two key modules: - **Synchronous video - frame modeling paradigm**: It enables video - level and frame - level embeddings to interact in each decoder layer, thus avoiding the error accumulation problem in the cascaded structure. - **Synchronous embedding optimization strategy**: It divides large video sequences into small segments for optimization, reducing the optimization complexity while maintaining the synchronization of video - level and frame - level embeddings. Through these two modules, SyncVIS can better represent the trajectories of instances in complex and challenging video scenes and has achieved state - of - the - art results on multiple benchmark datasets. ### Summary The main goal of this paper is to solve the limitations of existing video instance segmentation methods when dealing with complex video scenes by synchronously modeling video - level and frame - level embeddings, thereby improving the performance and robustness of the model.