Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Multi-scale network with shared cross-attention for audio–visual correlation learning

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Adapting Language-Audio Models as Few-Shot Audio Learners

Deep joint learning for language recognition

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Collaborative Learning for Language and Speaker Recognition.

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

MCLF: A Multi-grained Contrastive Learning Framework for ASR-robust Spoken Language Understanding

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Masked Audio Modeling with CLAP and Multi-Objective Learning

Natural Language Supervision for General-Purpose Audio Representations

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Multi-Grained Spatio-temporal Modeling for Lip-reading

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models