Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Research on Acceleration Method of Speech Recognition Training.

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Accelerator-Aware Training for Transducer-Based Speech Recognition

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Automatic Model Redundancy Reduction for Fast Back-Propagation for Deep Neural Networks in Speech Recognition

Speech Enhancement Method Based on LSTM Neural Network for Speech Recognition

Output-Gate Projected Gated Recurrent Unit for Speech Recognition

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Accelerating RNN-T Training and Inference Using CTC Guidance

Non-Autoregressive Speech Recognition with Error Correction Module

Alternating update layers for DBN-DNN fast training method

Speech recognition with deep recurrent neural networks

Inference skipping for more efficient real-time speech enhancement with parallel RNNs

Improving the Fusion of Acoustic and Text Representations in RNN-T

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Efficient Training of Neural Transducer for Speech Recognition

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training