Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

A robust super-resolution approach with sparsity constraint for near-field wideband acoustic imaging

Self-supervised Audio Spatialization with Correspondence Classifier

Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Sound Localization by Self-Supervised Time Delay Estimation

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Self-Supervised Learning for Multi-Channel Neural Transducer

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Learning Representations from Audio-Visual Spatial Alignment

Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features

SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence.

Self-supervised learning-based underwater acoustical signal classification via mask modeling

Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

IMPROVING MULTIMODAL SPEECH ENHANCEMENT BY INCORPORATING SELF-SUPERVISED AND CURRICULUM LEARNING