Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang,Peiwen Sun,Dongzhan Zhou,Guangyao Li,Honggang Zhang,Di Hu
2024-07-16
Abstract:Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{<a class="link-external link-https" href="https://gewu-lab.github.io/Ref-AVS" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://gewu-lab.github.io/Ref-AVS" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the problem of locating and segmenting target objects in natural dynamic audio-visual scenes using multimodal cues (including audio and visual descriptions). Specifically: 1. **Multimodal Reference Segmentation Task (Ref-AVS)**: - Traditional reference segmentation tasks mainly focus on silent visual scenes, neglecting the multimodal perception and interaction in human experiences. - This paper introduces a new task—Multimodal Reference Audio-Visual Segmentation (Ref-AVS), which aims to segment objects in the visual domain based on expressions containing multimodal cues. - These expressions are in the form of natural language and enrich the multimodal cues, including audio and visual descriptions. 2. **Dataset Construction**: - To facilitate research, the authors constructed the first Ref-AVS benchmark dataset, providing pixel-level annotations of targets based on corresponding multimodal cue expressions. - The dataset contains approximately 4,000 video clips with audio, with over 60% of the scenes containing multi-source sounds, and collected over 20,000 expert-verified reference expressions. 3. **Method Proposal**: - To tackle the Ref-AVS task, the authors proposed a new method called EEMC (Expression Enhancing with Multimodal Cues), which fully utilizes multimodal cues to provide precise segmentation guidance. - The method includes a Temporal Bi-modal Transformer and a Prompting with Multimodal Cues module. 4. **Experimental Results**: - Quantitative and qualitative experiments were conducted on three test subsets, comparing the proposed method with other existing methods. - The experimental results demonstrate that the proposed method can effectively utilize multimodal cues for precise target segmentation. In summary, this paper introduces a new multimodal reference audio-visual segmentation task and the corresponding dataset, and proposes a new method to address this challenging task, showcasing its superior performance in natural dynamic audio-visual scenes.