A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Shentong Mo,Pedro Morgado

2023-05-31

Abstract:The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.

Sound,Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively integrate audio and visual cues in audio - visual perception tasks to achieve the localization, separation and recognition of sound sources. Historically, these capabilities have usually been dealt with independently, and each task has its own method. However, due to the inherent connections among the localization, separation and recognition of sound sources, independent models may perform poorly because they are unable to capture the interdependencies among tasks. To solve this problem, the author proposes a unified audio - visual learning framework (called OneAVM), which integrates audio and visual cues for joint localization, separation and recognition. Specifically, OneAVM contains a shared audio - visual encoder and task - specific decoders, and is trained through three objectives: 1. **Audio - visual correspondence loss**: Align audio and visual representations. 2. **Traditional mixing and separation framework**: Handle the separation of visual sources. 3. **Mixed image alignment**: Mix images in the pixel space and align their representations with the representations of all corresponding sound sources. Through these objectives, OneAVM provides a more comprehensive audio - visual learning method, which can achieve effective cross - task transfer and improve the applicability of audio - visual models. Experimental results show that OneAVM outperforms existing baseline models on multiple datasets and achieves significant performance improvements in sound source localization, separation and nearest - neighbor recognition tasks.

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

Audio-Visual Event Localization in Unconstrained Videos

Audio-Visual Grouping Network for Sound Localization from Mixtures

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Multiple Sound Sources Localization from Coarse to Fine

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

UAVM: Towards Unifying Audio and Visual Models

Versatile audio-visual learning for emotion recognition

Curriculum Audiovisual Learning

Unsupervised Audio-Visual Segmentation with Modality Alignment

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Localizing Visual Sounds the Easy Way