Abstract:We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that only good signal processing can lead to top ASR performance in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28 on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46 is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76 on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

CACnet: Cube Attentional CNN for Automatic Speech Recognition

An improved hybrid CTC-Attention model for speech recognition

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

Residual Convolutional CTC Networks for Automatic Speech Recognition.

SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION

An Online Attention-based Model for Speech Recognition

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Massive End-to-end Models for Short Search Queries

Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Exploring Turkish Speech Recognition via Hybrid CTC/Attention Architecture and Multi-feature Fusion Network

An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention