Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Temporal Pixel-Level Semantic Understanding Through the VSPW Dataset

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Video Scene Parsing: an Overview of Deep Learning Methods and Datasets

Global and Compact Video Context Embedding for Video Semantic Segmentation

Large-scale Video Panoptic Segmentation in the Wild: A Benchmark

Towards Open-Vocabulary Video Semantic Segmentation

Construction Scene Parsing (CSP): Structured Annotations of Image Segmentation for Construction Semantic Understanding

Open Vocabulary Scene Parsing.

Semantic Segmentation on VSPW Dataset through Contrastive Loss and Multi-dataset Training Approach

Traffic Scene Parsing through the TSP6K Dataset

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Surveillance Video Parsing with Single Frame Supervision

Scene Parsing through ADE20K Dataset

Towards Open-Vocabulary Video Instance Segmentation

Towards Spatio-Temporal Video Scene Text Detection via Temporal Clustering

FSVVD: A Dataset of Full Scene Volumetric Video

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension