Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

Spatio-Temporal Segmentation with Depth-Inferred Videos of Static Scenes

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework With Spatio-Temporal Collaboration

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Coarse-to-Fine Video Instance Segmentation With Factorized Conditional Appearance Flows

A spatio-temporal network for video semantic segmentation in surgical videos

Video Object Segmentation using Space-Time Memory Networks

Temporo-Spatial Parallel Sparse Memory Networks for Efficient Video Object Segmentation

Spatiotemporal segmentation for stereoscopic video

Capturing the Spatio-Temporal Continuity for Video Semantic Segmentation.

Aggregating Spatio-temporal Context for Video Object Segmentation.

Spatiotemporal CNN for Video Object Segmentation

Unsupervised Spatio-temporal Latent Feature Clustering for Multiple-object Tracking and Segmentation

Temporal Feature Augmented Network for Video Instance Segmentation.

Video Object Segmentation by Learning Location-Sensitive Embeddings

Dual Temporal Memory Network for Efficient Video Object Segmentation

STFCN: Spatio-Temporal FCN for Semantic Video Segmentation