Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Visually Supervised Speaker Detection and Localization via Microphone Array

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Real-time Architecture for Audio-Visual Active Speaker Detection.

Cross-modal Supervision for Learning Active Speaker Detection in Video

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Target Active Speaker Detection with Audio-visual Cues

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Audio-visual child-adult speaker classification in dyadic interactions

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

Active Speakers in Context

Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach

Bio-Inspired Modality Fusion for Active Speaker Detection

Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Vision-Guided Robot Hearing

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement