Semi-AVS: Segmenting the Sounding Objects Via Semi-supervised Learning

Chengcheng Li,Zhengyi Liu,Wei Wu
DOI: https://doi.org/10.1145/3674225.3674394
2024-01-01
Abstract:Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the sounding objects from the visual frames. Existing method introduces the audio semantics and a regularization loss for guiding visual segmentation. However, the one-shot annotation and full-shot prediction fact in single-source dataset is discarded (i.e., only the ground truth of first sampled frame is given in a video). In this work, we propose a semi-supervised audio-visual segmentation framework called Semi-AVS, to propagate the mask of the first frame to the later frames since all the frames in a video share a same sound source. Furthermore, an audio-visual interaction module is designed to both locate object in the visual frame via audio and make the audio percept the visual context. Our method in single-source AVS task outperforms the state-of-the-art models by semi-supervised learning. Meanwhile audio-visual interaction module is also verified in fully supervised multi-source and semantic AVS tasks.
What problem does this paper attempt to address?