Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Pixel Objectness: Learning to Segment Generic Objects Automatically in Images and Videos

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Fusion target attention mask generation network for video segmentation

Video Object Segmentation by Learning Location-Sensitive Embeddings

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

Collaborative Video Object Segmentation by Foreground-Background Integration

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Learning Video Object Segmentation with Visual Memory

Learning to Segment Instances in Videos with Spatial Propagation Network

Video Object Segmentation via Structural Feature Reconfiguration

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

Joint Inductive and Transductive Learning for Video Object Segmentation

Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation

Full-duplex strategy for video object segmentation

FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything

Appearance-Based Refinement for Object-Centric Motion Segmentation

Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects

Learning Temporal Distribution and Spatial Correlation Towards Universal Moving Object Segmentation

Harnessing Object and Scene Semantics for Large-Scale Video Understanding.