Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Blending LSTMs into CNNs

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications

Advances in Convolutional Neural Networks

Holistic CNN Compression Via Low-Rank Decomposition with Knowledge Transfer.

Deep Neural Networks Language Model Based on CNN and LSTM Hybrid Architecture

Refining Architectures of Deep Convolutional Neural Networks

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Molding CNNs for text: non-linear, non-consecutive convolutions

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Convolutional Neural Networks Exploiting Attributes of Biological Neurons

Optimizing Image Classification: Automated Deep Learning Architecture Crafting with Network and Learning Hyperparameter Tuning

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

Towards Better Analysis of Deep Convolutional Neural Networks

CNN LEGO: Disassembling and Assembling Convolutional Neural Network

Tightly-coupled Convolutional Neural Network with Spatial-Temporal Memory for Text Classification.

IC Networks: Remodeling the Basic Unit for Convolutional Neural Networks

When Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural Networks for Face Recognition

Efficient and Accurate Approximations of Nonlinear Convolutional Networks

Stacked Broad Learning System: From Incremental Flatted Structure to Deep Model

Layer-Wise Training To Create Efficient Convolutional Neural Networks

Optimal Design of Convolutional Neural Network Architectures Using Teaching-Learning-Based Optimization for Image Classification