Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Audiovisual Speech Activity Detection With Advanced Long Short-Term Memory

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

End-to-End Audiovisual Speech Recognition System with Multitask Learning

Deep Temporal Architecture for Audiovisual Speech Recognition

Deep Audio-visual System for Closed-set Word-level Speech Recognition

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion

Speech Activity Detection Based on Multilingual Speech Recognition System

Audio Visual Speech Recognition with Multimodal Recurrent Neural Networks

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Real-time Architecture for Audio-Visual Active Speaker Detection.

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection

Robust end-to-end deep audiovisual speech recognition

Temporal Multimodal Learning in Audiovisual Speech Recognition.

Audio Visual Speech Recognition using Deep Recurrent Neural Networks