Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

LiDAR Video Object Segmentation with Dynamic Kernel Refinement

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Object tracking with 3D LIDAR via multi-task sparse learning

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

LiVOS: Light Video Object Segmentation with Gated Linear Matching

Dual Temporal Memory Network for Efficient Video Object Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation

Real-time Moving Object Segmentation with Tracking and Tracklet Belief

Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data

3D-SeqMOS: A Novel Sequential 3D Moving Object Segmentation in Autonomous Driving

Receding Moving Object Segmentation in 3D LiDAR Data Using Sparse 4D Convolutions

Dual temporal memory network with high-order spatio-temporal graph learning for video object segmentation

Segmented Curved-Voxel Occupancy Descriptor for Dynamic-Aware LiDAR Odometry and Mapping

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Learning Spatial and Temporal Variations for 4D Point Cloud Segmentation

Region Aware Video Object Segmentation With Deep Motion Modeling

StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

MotionBEV: Attention-Aware Online LiDAR Moving Object Segmentation with Bird's Eye View based Appearance and Motion Features

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation