Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition

Dependency-based Siamese Long Short-Term Memory Network for Learning Sentence Representations.

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

An Efficient LSTM Network for Emotion Recognition from Multichannel EEG Signals

Application of an Improved LSTM Model to Emotion Recognition

Speech Emotion Classification Using Attention-Based LSTM

Learning Long-Term Temporal Contexts Using Skip RNN for Continuous Emotion Recognition

Emotion Recognition Using Multimodal Residual LSTM Network

Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM

Extreme-Long-short Term Memory for Time-series Prediction

Efficient Modeling of Long Temporal Contexts for Continuous Emotion Recognition.

Action Recognition Algorithm of Spatio-Temporal Differential LSTM Based on Feature Enhancement

Long Short Term Memory Recurrent Neural Network Based Encoding Method for Emotion Recognition in Video.

EA-LSTM: Evolutionary attention-based LSTM for time series prediction

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

Emphasizing Essential Words for Sentiment Classification Based on Recurrent Neural Networks

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

Learning Expression Features via Deep Residual Attention Networks for Facial Expression Recognition From Video Sequences

An LSTM with Differential Structure and Its Application in Action Recognition

Video-EEG Based Collaborative Emotion Recognition Using LSTM and Information-Attention