Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

STVGFormer

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding

TransVOS: Video Object Segmentation with Transformers

Human-centric Spatio-Temporal Video Grounding with Visual Transformers

STVGBert - A Visual-linguistic Transformer Based Framework for Spatio-temporal Video Grounding.

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

Context-Guided Spatio-Temporal Video Grounding

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Learning Feature Semantic Matching for Spatio-Temporal Video Grounding

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Human-centric Spatio-Temporal Video Grounding Via the Combination of Mutual Matching Network and TubeDETR

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation

Video-Language Alignment via Spatio-Temporal Graph Transformer

Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding

Rethinking Video Sentence Grounding from a Tracking Perspective with Memory Network and Masked Attention

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding