Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Canonical Cortical Graph Neural Networks and its Application for Speech Enhancement in Audio-Visual Hearing Aids

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Neural decoding of attentional selection in multi-speaker environments without access to clean sources

A versatile deep-neural-network-based music preprocessing and remixing scheme for cochlear implant listeners

Deep Sensory Substitution: Noninvasively Enabling Biological Neural Networks to Receive Input from Artificial Neural Networks

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

A Neural-Network Framework for the Design of Individualised Hearing-Loss Compensation

Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modeling and User Preference Learning

Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Brain-controlled augmented hearing for spatially moving conversations in multi-talker environments

Corticomorphic Hybrid CNN-SNN Architecture for EEG-based Low-footprint Low-latency Auditory Attention Detection

Deep learning restores speech intelligibility in multi-talker interference for cochlear implant users

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

Towards reconstructing intelligible speech from the human auditory cortex