Abstract:While supervised learning approaches show great vitality and effectiveness in video object segmentation, most of them require large amounts of annotations which are expensive and time-consuming. Recently, self-supervised learning has attracted great attention by benefiting from unlabeled video sequences. However, current patch-based self-supervised video object segmentation methods only discriminate the patch from the entire image without distinguishing the object of interest from meaningless backgrounds or even occlusion. These disturbances deteriorate the extracted features and hinder the robustness of tracking when applied to real-world video sequences. In this paper, we propose a novel model named Tracker With Integration-Augmented Attention (TWIAA) to achieve both label-free and prominent performance. Specifically, we integrate both spatial and channel dimensions by introducing a feature spatial enhancement module and a two-stream channel module. With the combination of the two modules, the network can focus on exploring the discriminative object and suppressing the irrelevant part to improve the tracking robustness. Moreover, unlike other methods that calculate features separately on the search branch and template branch, the two designed modules coupled with the Siamese network compute the respective features of the search branch and the template branch jointly to augment the interdependence of the two branches. Such interdependence is injected into both spatial and channel dimensions. So that our approach establishes richer and more discriminative associations to identify the object more accurately. In addition, our method takes full advantage of cycle-consistency information in consecutive frames, which uses coherence as the learning signal to acquire object-oriented relationships. Extensive experiments and ablation studies are conducted on large VOS benchmarks, including DAVIS-2017, YouTube-VOS-2018, and YouTube-VOS-2019. The results verify that our proposed framework has both strong feature representation and competitive performance compared with supervised and self-supervised models.

Multidimensional Exploration of Segment Anything Model for Weakly Supervised Video Salient Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection

Weakly supervised salient object detection via bounding-box annotation and SAM model

WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection

Weakly Supervised Video Salient Object Detection via Point Supervision

A Visual Representation-guided Framework with Global Affinity for Weakly Supervised Salient Object Detection

Joint Multisource Saliency and Exemplar Mechanism for Weakly Supervised Video Object Segmentation.

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

SSFam: Scribble Supervised Salient Object Detection Family

Boosting Segment Anything Model Towards Open-Vocabulary Learning

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention

UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

A Novel Video Salient Object Detection Method Via Semisupervised Motion Quality Perception

Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2