Multidimensional Exploration of Segment Anything Model for Weakly Supervised Video Salient Object Detection

Binwei Xu,Qiuping Jiang,Xing Zhao,Chenyang Lu,Haoran Liang,Ronghua Liang
DOI: https://doi.org/10.1109/tcsvt.2024.3368053
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Fully supervised video salient object detection (VSOD) has made considerable breakthroughs using costly and time-consuming pixel-wise annotations. Recently, to achieve a trade-off between the annotation burden and the model performance, scribble-based VSOD tasks have attracted increasing attention. However, learning the complete object structure and precise boundary details from sparse scribble annotations remains challenging. In this paper, we propose a series of strategies to effectively explore valid information from the recently proposed segmentation foundation model “Segment Anything Model (SAM)” in various perspectives to address these challenges. Specifically, due to the limited performance of SAM on videos, we propose a SAM-guided label enhancement method instead of directly using the results of SAM, which can introduce edge information while reducing the interference of erroneous information. Moreover, we propose a SAM-driven spatiotemporal network guided by general semantic features from the SAM encoder to help the model be aware of global connections. Additionally, we propose a SAM-based global-aware loss, which further considers the affinity constraint between predicted results and foreground labels or background labels from a global perspective, guiding the model to perceive the complete salient objects. Experimental results demonstrate that our method outperforms state-of-the-art weakly supervised VSOD methods and is comparable to fully supervised VSOD methods.
What problem does this paper attempt to address?