Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

Speech Enhancement Using U-Net with Compressed Sensing

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

CATNet: Cross-modal fusion for audio-visual speech recognition

A Feature Integration Network for Multi-Channel Speech Enhancement

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments