Abstract:Reverberation and background noise can degrade speech quality and intelligibility when captured by a distant microphone. In recent years, researchers have developed several deep learning (DL)-based single-channel speech dereverberation systems that aim to minimize distortions introduced into speech captured in naturalistic environments. A majority of these DL-based systems enhance an unseen distorted speech signal by applying a predetermined set of weights to regions of the speech spectrogram, regardless of the degree of distortion within the respective regions. Such a system might not be an ideal solution for dereverberation task. To address this, we present a DL-based end-to-end single-channel speech dereverberation system that uses deformable convolution networks (DCN) that dynamically adjusts its receptive field based on the degree of distortions within an unseen speech signal. The proposed system includes the following components to simultaneously enhance the magnitude and phase responses of speech, which leads to improved perceptual quality: (i) a complex spectrum enhancement module that uses multi-frame filtering technique to implicitly correct the phase response, (ii) a magnitude enhancement module that suppresses dominant reflections and recovers the formant structure using deep filtering (DF) technique, and (iii) a speech activity detection (SAD) estimation module that predicts frame-wise speech activity to suppress residuals in non-speech regions. We assess the performance of the proposed system by employing objective speech quality metrics on both simulated and real speech recordings from the REVERB challenge corpus. The experimental results demonstrate the benefits of using DCNs and multi-frame filtering for speech dereverberation task. We compare the performance of our proposed system against other signal processing (SP) and DL-based systems and observe that it consistently outperforms other approaches across all speech quality metrics.

Deep Transform: Time-Domain Audio Error Correction via Probabilistic Re-Synthesis

Deep Transform: Cocktail Party Source Separation via Probabilistic Re-Synthesis

Speech Reconstruction Using a Deep Partially Supervised Neural Network

Towards reconstructing intelligible speech from the human auditory cortex

Active Restoration of Lost Audio Signals Using Machine Learning and Latent Information

Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

Deep Audio Waveform Prior

Audio Keyword Reconstruction from On-Device Motion Sensor Signals Via Neural Frequency Unfolding.

LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

Deep Prior-Based Audio Inpainting Using Multi-Resolution Harmonic Convolutional Neural Networks

Deep learning restores speech intelligibility in multi-talker interference for cochlear implant users

Reverb Conversion of Mixed Vocal Tracks Using an End-to-end Convolutional Deep Neural Network

High-Fidelity Noise Reduction with Differentiable Signal Processing

Deep Neural Imputation: A Framework for Recovering Incomplete Brain Recordings

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

Monaural Speech Dereverberation using Deformable Convolutional Networks