Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling

Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition

On Exploring PDE Modeling for Point Cloud Video Representation Learning

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

Adaptive Channel Encoding Transformer for Point Cloud Analysis

Spatial Transformer for 3D Point Clouds

Efficient Point Cloud Video Recognition via Spatio-Temporal Pruning for MEC Based Consumer Applications

PVT: Point-Voxel Transformer for Point Cloud Learning

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

PCT: Point cloud transformer

TAPTR: Tracking Any Point with Transformers as Detection

PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture

3DPCT: 3D Point Cloud Transformer with Dual Self-attention

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

PointMTL: Multi-Transform Learning for Effective 3D Point Cloud Representations