Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

Learning a Contextual Multi-Thread Model for Movie/TV Scene Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Learning Video Context as Interleaved Multimodal Sequences

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

From Trailers to Storylines: An Efficient Way to Learn from Movies

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection

Optimized Video Scene Segmentation.

Learning a Scene Contextual Model for Tracking and Abnormality Detection

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

Scene Segmentation Based on Video Structure and Spectral Methods

Video Shot Grouping Using Best-First Model Merging

Method Based On Temporal Constrain Shot Method Based On Temporal Constrain Shot Similarity.

Target Adaptive Context Aggregation for Video Scene Graph Generation

Movies2Scenes: Using Movie Metadata to Learn Scene Representation

Video Scene Detection Using Slide Windows Method Based on Temporal Constrain Shot Similarity

Visual Storylines: Semantic Visualization of Movie Sequence.

Discovery of Shared Semantic Spaces for Multi-Scene Video Query and Summarization

Exploring the Design Space of Visual Context Representation in Video MLLMs

Learning Local and Global Temporal Contexts for Video Semantic Segmentation