Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

Attention Bidirectional LSTM Networks Based Mime Speech Recognition Using Semg Data

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Emotional Statistical Parametric Speech Synthesis Using LSTM-RNNs

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Learning Cross-Lingual Knowledge With Multilingual Blstm For Emphasis Detection With Limited Training Data

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition