Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Qi Yang,Xing Nie,Tong Li,Pengfei Gao,Ying Guo,Cheng Zhen,Pengfei Yan,Shiming Xiang
2024-04-07
Abstract:Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at <a class="link-external link-https" href="https://yannqi.github.io/AVS-COMBO/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to combine the information of audio and visual modalities more effectively in the Audio - Visual Segmentation (AVS) task to achieve pixel - level segmentation of sounding objects in videos. Specifically, the paper proposes an innovative framework named COMBO, aiming to explore three bilateral relationships in AVS: pixel entanglement, modality entanglement, and temporal entanglement. These three entanglements respectively solve the following problems: 1. **Pixel Entanglement**: Solve the problem of inaccurate image - to - mask prediction caused by background noise. By introducing the Siamese Encoder Module (Siam - Encoder Module, SEM), the prior knowledge generated by the base model is utilized to enhance the accuracy of visual features. 2. **Modality Entanglement**: Solve the alignment problem between audio and visual signals. By designing the Bilateral - Fusion Module (Bilateral - Fusion Module, BFM), the two - way fusion of audio and visual signals is achieved, improving the efficiency of cross - modal matching. 3. **Temporal Entanglement**: Solve the problem of the transfer of temporal information between frames. By introducing the Adaptive Inter - frame Consistency Loss, the inherent temporal characteristics of the audio - visual task are utilized to enhance the consistency of the output. The main contributions of the paper include: - Proposing the Siam - Encoder Module (SEM) for mining potential pixel entanglement. - Designing the Bilateral - Fusion Module (BFM) to fully utilize the potential of audio and visual modalities and explore modality entanglement. - Introducing the Adaptive Inter - frame Consistency Loss based on inherent temporal consistency to enhance temporal entanglement. - Experimental results show that COMBO significantly outperforms the existing state - of - the - art methods on the challenging AVSBench - object and AVSBench - semantic datasets. Through these innovations, COMBO has achieved significant performance improvement in the audio - visual segmentation task, especially when dealing with single - sound - source (S4) and multi - sound - source (MS3) scenarios.