Abstract:Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{<a class="link-external link-https" href="https://github.com/zhang-tao-whu/DVIS" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/zhang-tao-whu/DVIS" rel="external noopener nofollow">this https URL</a>}.

PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation

Multi Frame Obscene Video Detection with ViT

BoxVIS: Video Instance Segmentation with Box Annotations

DVIS: Decoupled Video Instance Segmentation Framework

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

Human Instance Segmentation and Tracking via Data Association and Single-stage Detector

UVIS: Unsupervised Video Instance Segmentation

Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation

SipMaskv2: Enhanced Fast Image and Video Instance Segmentation

Video Instance Matting

MSN: Efficient Online Mask Selection Network for Video Instance Segmentation

Augmenting Efficient Real-time Surgical Instrument Segmentation in Video with Point Tracking and Segment Anything

Real-time Human-Centric Segmentation for Complex Video Scenes

Occluded Video Instance Segmentation: A Benchmark

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries

Video Instance Segmentation with a Propose-Reduce Paradigm

UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

SyncVIS: Synchronized Video Instance Segmentation

Fast Online Object Tracking and Segmentation: A Unifying Approach