Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

A robust model for domain recognition of acoustic communication using Bidirectional LSTM and deep neural network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Attention Bidirectional LSTM Networks Based Mime Speech Recognition Using Semg Data

Bidirectional RNN for Audio Deep Learning in an End-to-End Model

A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognition

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Speech Recognition using Convolution Deep Neural Networks

A Survey of Deep Learning Techniques in Speech Recognition

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Modified layer deep convolution neural network for text-independent speaker recognition

Deep neural network architectures for dysarthric speech analysis and recognition

Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Emotion Recognition in Audio and Video Using Deep Neural Networks

Convolutional neural network based language identification system: A spectrogram based approach

Automated Sign to Speech Conversion Model using Deep Learning

Deep Recurrent Neural Networks for Acoustic Modelling

Speech Recognition with Deep Learning

Deepsign: Sign Language Detection and Recognition Using Deep Learning

Direct Modelling of Speech Emotion from Raw Speech

Structured Discriminative Models Using Deep Neural-Network Features.

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning