Abstract:Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{<a class="link-external link-https" href="https://gewu-lab.github.io/Ref-AVS" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://gewu-lab.github.io/Ref-AVS" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the problem of locating and segmenting target objects in natural dynamic audio-visual scenes using multimodal cues (including audio and visual descriptions). Specifically: 1. **Multimodal Reference Segmentation Task (Ref-AVS)**: - Traditional reference segmentation tasks mainly focus on silent visual scenes, neglecting the multimodal perception and interaction in human experiences. - This paper introduces a new task—Multimodal Reference Audio-Visual Segmentation (Ref-AVS), which aims to segment objects in the visual domain based on expressions containing multimodal cues. - These expressions are in the form of natural language and enrich the multimodal cues, including audio and visual descriptions. 2. **Dataset Construction**: - To facilitate research, the authors constructed the first Ref-AVS benchmark dataset, providing pixel-level annotations of targets based on corresponding multimodal cue expressions. - The dataset contains approximately 4,000 video clips with audio, with over 60% of the scenes containing multi-source sounds, and collected over 20,000 expert-verified reference expressions. 3. **Method Proposal**: - To tackle the Ref-AVS task, the authors proposed a new method called EEMC (Expression Enhancing with Multimodal Cues), which fully utilizes multimodal cues to provide precise segmentation guidance. - The method includes a Temporal Bi-modal Transformer and a Prompting with Multimodal Cues module. 4. **Experimental Results**: - Quantitative and qualitative experiments were conducted on three test subsets, comparing the proposed method with other existing methods. - The experimental results demonstrate that the proposed method can effectively utilize multimodal cues for precise target segmentation. In summary, this paper introduces a new multimodal reference audio-visual segmentation task and the corresponding dataset, and proposes a new method to address this challenging task, showcasing its superior performance in natural dynamic audio-visual scenes.

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Audio-Visual Instance Segmentation

Audio-Visual Segmentation

Audio-Visual Segmentation with Semantics

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Annotation-free Audio-Visual Segmentation

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Open-Vocabulary Audio-Visual Semantic Segmentation

3D Audio-Visual Segmentation

Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Unsupervised Audio-Visual Segmentation with Modality Alignment