Abstract:Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve ∼ 50 J & F ) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git .

VCSOD: A Video Conversion Scheme Based on Salient Object Detection Algorithm

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Horizontal-to-Vertical Video Conversion

An automatic 2D to 3D conversion algorithm using multi-depth cues

Stereoscopic video conversion based on depth tracking

H2V4Sports: Real-Time Horizontal-to-Vertical Video Converter for Sports Lives Via Fast Object Detection and Tracking

A New Approach of 2D-to-3d Video Conversion and Its Implementation on Embedded System

Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

Intelligent Analysis Oriented Surveillance Video Coding.

Video Object Segmentation with 3D Convolution Network

Video Salient Object Detection Via Cross-Frame Cellular Automata

A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception

A Novel Method For Automatic 2d-To-3d Video Conversion

Scalable Video Object Segmentation with Simplified Framework

A Novel Video Salient Object Detection Method Via Semisupervised Motion Quality Perception

A Salience & Motion State Based Quality Adjustment for 360-Degree Video Transmission.

A Novel 2D-to-3D Video Conversion Method Using Time-Coherent Depth Maps

VCSL: Video Compressive Sensing with Low-complexity ROI Detection in Compressed Domain

Video-object segmentation and 3D-trajectory estimation for monocular video sequences

An efficient method for automatic stereoscopic conversion

A Distributed 2D-to-3d Video Conversion System