Abstract:While supervised learning approaches show great vitality and effectiveness in video object segmentation, most of them require large amounts of annotations which are expensive and time-consuming. Recently, self-supervised learning has attracted great attention by benefiting from unlabeled video sequences. However, current patch-based self-supervised video object segmentation methods only discriminate the patch from the entire image without distinguishing the object of interest from meaningless backgrounds or even occlusion. These disturbances deteriorate the extracted features and hinder the robustness of tracking when applied to real-world video sequences. In this paper, we propose a novel model named Tracker With Integration-Augmented Attention (TWIAA) to achieve both label-free and prominent performance. Specifically, we integrate both spatial and channel dimensions by introducing a feature spatial enhancement module and a two-stream channel module. With the combination of the two modules, the network can focus on exploring the discriminative object and suppressing the irrelevant part to improve the tracking robustness. Moreover, unlike other methods that calculate features separately on the search branch and template branch, the two designed modules coupled with the Siamese network compute the respective features of the search branch and the template branch jointly to augment the interdependence of the two branches. Such interdependence is injected into both spatial and channel dimensions. So that our approach establishes richer and more discriminative associations to identify the object more accurately. In addition, our method takes full advantage of cycle-consistency information in consecutive frames, which uses coherence as the learning signal to acquire object-oriented relationships. Extensive experiments and ablation studies are conducted on large VOS benchmarks, including DAVIS-2017, YouTube-VOS-2018, and YouTube-VOS-2019. The results verify that our proposed framework has both strong feature representation and competitive performance compared with supervised and self-supervised models.

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Spatial-Temporal Feature Aggregation Network for Video Object Detection

Temporal-adaptive sparse feature aggregation for video object detection

Practical Video Object Detection via Feature Selection and Aggregation

Adaptive Feature Aggregation for Video Object Detection

Spatial-Temporal Multi-level Association for Video Object Segmentation

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Multi-view Aggregation for Real-Time Accurate Object Detection of a Moving Camera

Video object matching across multiple non-overlapping camera views based on multi-feature fusion and incremental learning.

Video object detection via space–time feature aggregation and result reuse

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Learning Spatial-Semantic Features for Robust Video Object Segmentation

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation

Voxelized 3D Feature Aggregation for Multiview Detection

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection

Fianet: Video Object Detection Via Joint Feature-Level and Instance-Level Aggregation