Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

Hearing Like Seeing

Learning Discriminative Joint Embeddings for Efficient Face and Voice Association.

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Looking and Hearing into Details: Dual-Enhanced Siamese Adversarial Network for Audio-Visual Matching

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

An Efficient Momentum Framework for Face-Voice Association Learning.

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Exploring Robust Face-Voice Matching in Multilingual Environments

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

EFT: Expert Fusion Transformer for Voice-Face Association Learning.

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

A Novel Distance Learning For Elastic Cross-Modal Audio-Visual Matching

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention