Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Gated Convolutional Lstm For Speech Commands Recognition

Attention Bidirectional LSTM Networks Based Mime Speech Recognition Using Semg Data

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

A neural attention model for speech command recognition

Speech Command Recognition in Computationally Constrained Environments with a Quadratic Self-organized Operational Layer

Deep Learning Approaches for Understanding Simple Speech Commands

Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning

Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms

Exploiting Hybrid Models of Tensor-Train Networks for Spoken Command Recognition

Long Short-Term Memory based Convolutional Recurrent Neural Networks for Large Vocabulary Speech Recognition

A High Accuracy Multiple-Command Speech Recognition ASIC Based on Configurable One-Dimension Convolutional Neural Network.

Deep LSTM for Large Vocabulary Continuous Speech Recognition

Towards High Performance LVCSR in Speech-to-Speech Translation System on Smart Phones.

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Attention-Based Gated Scaling Adaptive Acoustic Model for CTC-Based Speech Recognition.

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices