Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Improve Data Utilization with Two-stage Learning in CNN-LSTM-based Voice Activity Detection

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Voice Activity Detection Based on Time-Delay Neural Networks

Voice activity detection in the wild: A data-driven approach using teacher-student training

A Universal VAD Based on Jointly Trained Deep Neural Networks.

DNN-based Voice Activity Detection for Speaker Recognition

Denoising Deep Neural Networks Based Voice Activity Detection

Voice activity detection using a local-global attention model

Incorporating VAD into ASR System by Multi-task Learning

Speech enhancement aided end-to-end multi-task learning for voice activity detection

Multimodal Voice Activity Detection

Deep Belief Networks Based Voice Activity Detection.

Speech recognition method based on DNN-LSTM combined with Wiener filtering algorithm

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Transfer Learning for Voice Activity Detection: A Denoising Deep Neural Network Perspective

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

Voice Disorder Detection Using Long Short Term Memory (LSTM) Model

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

SVVAD: Personal Voice Activity Detection for Speaker Verification