Abstract:Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at <a class="link-external link-https" href="https://yannqi.github.io/AVS-COMBO/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to combine the information of audio and visual modalities more effectively in the Audio - Visual Segmentation (AVS) task to achieve pixel - level segmentation of sounding objects in videos. Specifically, the paper proposes an innovative framework named COMBO, aiming to explore three bilateral relationships in AVS: pixel entanglement, modality entanglement, and temporal entanglement. These three entanglements respectively solve the following problems: 1. **Pixel Entanglement**: Solve the problem of inaccurate image - to - mask prediction caused by background noise. By introducing the Siamese Encoder Module (Siam - Encoder Module, SEM), the prior knowledge generated by the base model is utilized to enhance the accuracy of visual features. 2. **Modality Entanglement**: Solve the alignment problem between audio and visual signals. By designing the Bilateral - Fusion Module (Bilateral - Fusion Module, BFM), the two - way fusion of audio and visual signals is achieved, improving the efficiency of cross - modal matching. 3. **Temporal Entanglement**: Solve the problem of the transfer of temporal information between frames. By introducing the Adaptive Inter - frame Consistency Loss, the inherent temporal characteristics of the audio - visual task are utilized to enhance the consistency of the output. The main contributions of the paper include: - Proposing the Siam - Encoder Module (SEM) for mining potential pixel entanglement. - Designing the Bilateral - Fusion Module (BFM) to fully utilize the potential of audio and visual modalities and explore modality entanglement. - Introducing the Adaptive Inter - frame Consistency Loss based on inherent temporal consistency to enhance temporal entanglement. - Experimental results show that COMBO significantly outperforms the existing state - of - the - art methods on the challenging AVSBench - object and AVSBench - semantic datasets. Through these innovations, COMBO has achieved significant performance improvement in the audio - visual segmentation task, especially when dealing with single - sound - source (S4) and multi - sound - source (MS3) scenarios.

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-Visual Segmentation

AVSegFormer: Audio-Visual Segmentation with Transformer

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Audio-Visual Segmentation with Semantics

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Unsupervised Audio-Visual Segmentation with Modality Alignment

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation