Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Learning Referring Video Object Segmentation from Weak Annotation

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

In Defense of Online Models for Video Instance Segmentation

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

Robust Referring Video Object Segmentation with Cyclic Structural Consensus

Tracking-forced Referring Video Object Segmentation

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

Unsupervised Online Video Object Segmentation with Motion Property Understanding

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation