Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

Video Object Segmentation Based on Supervoxel for Multimedia Corpus Construction

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Detection and Segmentation of Moving Objects Using Temporal and Spatial Cues

Multiresolution Segmentation of Video Objects in the Compression Domain

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

Real-time spatiotemporal segmentation of video objects in the H.264 compressed domain

GPU-Based Supervoxel Generation with a Novel Anisotropic Metric.

Video Object Segmentation with Dynamic Query Modulation

An Efficient Compressed Domain Moving Object Segmentation Algorithm Based on Motion Vector Field

Weak Supervision Learning for Object Co-Segmentation

Robust moving object segmentation in the compressed domain for H.264/AVC video stream

A Study of Actor and Action Semantic Retention in Video Supervoxel Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Video Object Discovery and Co-segmentation with Extremely Weak Supervision.

Video Object Segmentation with Shape Cue Based on Spatiotemporal Superpixel Neighbourhood

Unsupervised Spatio-Temporal Segmentation for Extracting Moving Objects in Video Sequences

Video Object Co-Segmentation via Subspace Clustering and Quadratic Pseudo-Boolean Optimization in an MRF Framework

Video Object Segmentation with 3D Convolution Network

A Stereo Video Object Segmentation Algorithm Based on Motion Detection and Disparity

Supervised Video Object Segmentation Using a Small Number of Interactions.

Video-object segmentation and 3D-trajectory estimation for monocular video sequences