A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Shentong Mo,Pedro Morgado
2023-05-31
Abstract:The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.
Sound,Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively integrate audio and visual cues in audio - visual perception tasks to achieve the localization, separation and recognition of sound sources. Historically, these capabilities have usually been dealt with independently, and each task has its own method. However, due to the inherent connections among the localization, separation and recognition of sound sources, independent models may perform poorly because they are unable to capture the interdependencies among tasks. To solve this problem, the author proposes a unified audio - visual learning framework (called OneAVM), which integrates audio and visual cues for joint localization, separation and recognition. Specifically, OneAVM contains a shared audio - visual encoder and task - specific decoders, and is trained through three objectives: 1. **Audio - visual correspondence loss**: Align audio and visual representations. 2. **Traditional mixing and separation framework**: Handle the separation of visual sources. 3. **Mixed image alignment**: Mix images in the pixel space and align their representations with the representations of all corresponding sound sources. Through these objectives, OneAVM provides a more comprehensive audio - visual learning method, which can achieve effective cross - task transfer and improve the applicability of audio - visual models. Experimental results show that OneAVM outperforms existing baseline models on multiple datasets and achieves significant performance improvements in sound source localization, separation and nearest - neighbor recognition tasks.