Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Improving Limited Resource Speech Recognition Performance with Latent Regression Bayesian Network

Tibetan Multi-Dialect Speech Recognition Using Latent Regression Bayesian Network and End-To-End Mode

Bayesian Neural Network Language Modeling for Speech Recognition

Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Differentiating Between Posed And Spontaneous Expressions With Latent Regression Bayesian Network

Posed and Spontaneous Expression Distinction Using Latent Regression Bayesian Networks

A General Procedure for Improving Language Models in Low-Resource Speech Recognition

Improving Blstm Rnn Based Mandarin Speech Recognition Using Accent Dependent Bottleneck Features

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Speech Enhancement Method Based on LSTM Neural Network for Speech Recognition

Improving Bottleneck Features for Automatic Speech Recognition Using Gammatone-Based Cochleagram and Sparsity Regularization

Improvements on bottleneck feature for large vocabulary continuous speech recognition

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Speech Bottleneck Feature Extraction Method Based on Overlapping Group Lasso Sparse Deep Neural Network

Improving deep neural networks for LVCSR using dropout and shrinking structure

Adversarial Multilingual Training for Low-Resource Speech Recognition.

Local trajectory based speech enhancement for robust speech recognition with deep neural network