Abstract:Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{<a class="link-external link-https" href="https://github.com/zhang-tao-whu/DVIS" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/zhang-tao-whu/DVIS" rel="external noopener nofollow">this https URL</a>}.

fVDB : A Deep-Learning Framework for Sparse, Large Scale, and High Performance Spatial Intelligence

fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence

VDBblox: Accurate and Efficient Distance Fields for Path Planning and Mesh Reconstruction

Depth-Box VDB: Accelerate Sparse Volume Rendering with Depth Maps Through Voxel Database

NeuralVDB: High-resolution Sparse Volume Representation using Hierarchical Neural Networks

A Framework for the Volumetric Integration of Depth Images

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

FG-Net: A Fast and Accurate Framework for Large-Scale LiDAR Point Cloud Understanding

VDBFusion: Flexible and Efficient TSDF Integration of Range Sensor Data

VDB-GPDF: Online Gaussian Process Distance Field with VDB Structure

DVFENet: Dual-branch voxel feature extraction network for 3D object detection

Hierarchical, Dense and Dynamic 3D Reconstruction Based on VDB Data Structure for Robotic Manipulation Tasks

SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection

FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels

Deep Direct Volume Rendering: Learning Visual Feature Mappings From Exemplary Images

Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

Point-Voxel CNN for Efficient 3D Deep Learning

DVIS: Decoupled Video Instance Segmentation Framework

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion