Abstract:The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios (SNR < −5dB) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.

AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

SEGAN: Speech Enhancement Generative Adversarial Network

A Multi-Scale Generative Adversarial Network for Real-World Image Denoising

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

Environmental Noise Reduction based on Deep Denoising Autoencoder

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

Denoising Speech Signals with Hifi-Coulomb-GANs

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition

Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

Deep Generative Adversarial Networks for the Sparse Signal Denoising

Data Augmentation using Conditional Generative Adversarial Networks for Robust Speech Recognition

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Towards Generalized Speech Enhancement with Generative Adversarial Networks

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement