Abstract:While supervised learning approaches show great vitality and effectiveness in video object segmentation, most of them require large amounts of annotations which are expensive and time-consuming. Recently, self-supervised learning has attracted great attention by benefiting from unlabeled video sequences. However, current patch-based self-supervised video object segmentation methods only discriminate the patch from the entire image without distinguishing the object of interest from meaningless backgrounds or even occlusion. These disturbances deteriorate the extracted features and hinder the robustness of tracking when applied to real-world video sequences. In this paper, we propose a novel model named Tracker With Integration-Augmented Attention (TWIAA) to achieve both label-free and prominent performance. Specifically, we integrate both spatial and channel dimensions by introducing a feature spatial enhancement module and a two-stream channel module. With the combination of the two modules, the network can focus on exploring the discriminative object and suppressing the irrelevant part to improve the tracking robustness. Moreover, unlike other methods that calculate features separately on the search branch and template branch, the two designed modules coupled with the Siamese network compute the respective features of the search branch and the template branch jointly to augment the interdependence of the two branches. Such interdependence is injected into both spatial and channel dimensions. So that our approach establishes richer and more discriminative associations to identify the object more accurately. In addition, our method takes full advantage of cycle-consistency information in consecutive frames, which uses coherence as the learning signal to acquire object-oriented relationships. Extensive experiments and ablation studies are conducted on large VOS benchmarks, including DAVIS-2017, YouTube-VOS-2018, and YouTube-VOS-2019. The results verify that our proposed framework has both strong feature representation and competitive performance compared with supervised and self-supervised models.

Linking vision and motion for self-supervised object-centric perception

Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Self-supervised Visual Reinforcement Learning with Object-centric Representations

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Deep Object-Centric Policies for Autonomous Driving

Online Object Representations with Contrastive Learning

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

A self-supervised monocular odometry with visual-inertial and depth representations

Self-Supervised Multi-Object Tracking For Autonomous Driving From Consistency Across Timescales

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

Self-Improving Visual Odometry

CarFormer: Self-Driving with Learned Object-Centric Representations

Self-supervised Amodal Video Object Segmentation

View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection

Self-Supervised monocular visual odometry based on cross-correlation

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry

Salient Sparse Visual Odometry With Pose-Only Supervision

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos