Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

Dblstm-Based Multi-Task Learning for Pitch Transformation in Voice Conversion

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

Improving Recognition-Synthesis Based Any-to-one Voice Conversion with Cyclic Training

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Voice Conversion Using Generative Trained Deep Neural Networks with Multiple Frame Spectral Envelopes

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Residual Speaker Representation for One-Shot Voice Conversion

The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

A Modularized Neural Network with Language-Specific Output Layers for Cross-lingual Voice Conversion