Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge.

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Real-time Semantic Segmentation with Weighted Factorized-Depthwise Convolution

Real Time Video Object Segmentation in Compressed Domain

Efficient Semantic Video Segmentation with Per-Frame Inference

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

Tamed Warping Network for High-Resolution Semantic Video Segmentation

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Real-Time Semantic Segmentation With Fast Attention

How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation

Efficient Video Segmentation Models with Per-frame Inference

FBSNet: A Fast Bilateral Symmetrical Network for Real-Time Semantic Segmentation

How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network

Closing the Calibration Gap: A Real-Time Multi-Modal Fusion Framework for 3D Semantic Segmentation

Mask Propagation for Efficient Video Semantic Segmentation

Fast-SegNet: fast semantic segmentation network for small objects

LiDAR-Based Real-Time Panoptic Segmentation via Spatiotemporal Sequential Data Fusion

Tokenizing Features for Fast Video Object Segmentation

Real-Time Semantic Segmentation via Spatial-Detail Guided Context Propagation