Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Discriminative method for recurrent neural network language models

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Discriminative Acoustic Word Embeddings: Recurrent Neural Network-Based Approaches

Recurrent Neural Network Based Language Model Adaptation for Accent Mandarin Speech.

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

A Comparative Study of RPCL and MCE Based Discriminative Training Methods for LVCSR.

A Latent Variable Recurrent Neural Network for Discourse Relation Language Models

Recurrent Memory Networks for Language Modeling

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Learning to Write with Cooperative Discriminators

Discriminative training of GMM-HMM acoustic model by RPCL learning

Discriminative Boosting Regression Backend for Phonotactic Language Recognition

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Variance regularization of RNNLM for speech recognition

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Alignment Restricted Streaming Recurrent Neural Network Transducer.

Improvement Comparison of Different Lattice-based Discriminative Training Methods in Chinese-monolingual and Chinese-English-bilingual Speech Recognition

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition.

An Empirical Study of Language Model Integration for Transducer Based Speech Recognition