Abstract:Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise robustness of very deep convolutional neural networks (VDCNN). Based on our work on VDCNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network (VDCRN). This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs. Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This paper focuses on factor aware training (FAT) and cluster adaptive training (CAT). For FAT, a unified framework is explored. For CAT, two schemes are first explored to construct the bases in the canonical model; furthermore, a factorized version of CAT is designed to address multiple nonspeech variabilities in one model. Finally, a complete multipass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 (simulated data with additive noise and channel distortion), CHiME4 (both simulated and real data with additive noise and reverberation), and the AMI meeting transcription task (real data with significant reverberation). The evaluation not only includes different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate (WER). The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or long short-term memory. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.

Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in Speech Recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Multi-scale Feature Based Convolutional Neural Networks for Large Vocabulary Speech Recognition

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Residual Convolutional CTC Networks for Automatic Speech Recognition.

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

Audio-Visual Efficient Conformer for Robust Speech Recognition

An efficient joint training model for monaural noisy-reverberant speech recognition

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition.

Cascaded CNN-resBiLSTM-CTC: an End-to-End Acoustic Model for Speech Recognition.

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Compact Feedforward Sequential Memory Networks for Large Vocabulary Continuous Speech Recognition

Multi-task Joint-Learning of Deep Neural Networks for Robust Speech Recognition

Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Long Short-Term Memory based Convolutional Recurrent Neural Networks for Large Vocabulary Speech Recognition