Abstract:In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two-step environment-dependent DAE.

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Deep Neural Network-Based Bottleneck Feature and Denoising Autoencoder-Based Dereverberation for Distant-Talking Speaker Identification.

Environment-dependent Denoising Autoencoder for Distant-Talking Speech Recognition

Combination of Bottleneck Feature Extraction and Dereverberation for Distant-Talking Speech Recognition

Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition

Speech Recognition by Denoising and Dereverberation Based on Spectral Subtraction in a Real Noisy Reverberant Environment

Supervised Single-Channel Speech Dereverberation And Denoising Using A Two-Stage Processing

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

Distant-Talking Speech Recognition Based On Spectral Subtraction By Multi-Channel Lms Algorithm

Dereverberantion Based on Generalized Spectral Subtraction for Distant-Talking Speaker Recognition

Supervised Single-Channel Speech Dereverberation and Denoising Using a Two-Stage Model Based Sparse Representation.

Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Hands-free Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

A Fast Convolutional Self-Attention Based Speech Dereverberation Method For Robust Speech Recognition

Monaural Speech Dereverberation using Deformable Convolutional Networks

Multi-Channel Speech Denoising for Machine Ears

Speech Selection and Environmental Adaptation for Asynchronous Speech Recognition

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.