Abstract:A deep learning approach has been widely applied in sequence modeling problems. In terms of automatic speech recognition (ASR), its performance has significantly been improved by increasing large speech corpus and deeper neural network. Especially, recurrent neural network and deep convolutional neural network have been applied in ASR successfully. Given the arising problem of training speed, we build a novel deep recurrent convolutional network for acoustic modeling and then apply deep residual learning to it. Our experiments show that it has not only faster convergence speed but better recognition accuracy over traditional deep convolutional recurrent network. In the experiments, we compare the convergence speed of our novel deep recurrent convolutional networks and traditional deep convolutional recurrent networks. With faster convergence speed, our novel deep recurrent convolutional networks can reach the comparable performance. We further show that applying deep residual learning can boost the convergence speed of our novel deep recurret convolutional networks. Finally, we evaluate all our experimental networks by phoneme error rate (PER) with our proposed bidirectional statistical n-gram language model. Our evaluation results show that our newly proposed deep recurrent convolutional network applied with deep residual learning can reach the best PER of 17.33\% with the fastest convergence speed on TIMIT database. The outstanding performance of our novel deep recurrent convolutional neural network with deep residual learning indicates that it can be potentially adopted in other sequential problems.

Self-Convolution for Automatic Speech Recognition.

Self-Attention Networks for Text-Independent Speaker Verification

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Self-Attention for Audio Super-Resolution

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

CACnet: Cube Attentional CNN for Automatic Speech Recognition

On the Integration of Self-Attention and Convolution

Self-attention Based Speaker Recognition Using Cluster-Range Loss

A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

Self-Attention Transducers for End-to-End Speech Recognition

Self-consistent context aware conformer transducer for speech recognition

Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Voice and accompaniment separation in music using self-attention convolutional neural network

Learning Contextual Representation with Convolution Bank and Multi-head Self-attention for Speech Emphasis Detection.

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition