Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach

Audio-Visual Approach For Multimodal Concurrent Speaker Detection

Consformer: Consciousness Detection Using Transformer Networks With Correntropy-Based Measures

Monaural Multi-Speaker Speech Separation Using Efficient Transformer Model

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Multi-microphone fusion for detection of speech and acoustic events in smart spaces

A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

Backchannel Detection and Agreement Estimation from Video with Transformer Networks

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Speaker Change Detection for Transformer Transducer ASR

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks

Explore Relative and Context Information with Transformer for Joint Acoustic Echo Cancellation and Speech Enhancement

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Multi-Modal Transformers Utterance-Level Code-Switching Detection

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers