Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer

Attention Bidirectional LSTM Networks Based Mime Speech Recognition Using Semg Data

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

Dblstm-Based Multi-Task Learning for Pitch Transformation in Voice Conversion

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Deep Feed-Forward Sequential Memory Networks for Speech Synthesis

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Densely Connected Progressive Learning For Lstm-Based Speech Enhancement

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Rapid Adaptation For Deep Neural Networks Through Multi-Task Learning