Abstract:Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

A Novel Temporal Attentive-Pooling based Convolutional Recurrent Architecture for Acoustic Signal Enhancement

Attention-Based Deep Spiking Neural Networks for Temporal Credit Assignment Problems.

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Explore Relative and Context Information with Transformer for Joint Acoustic Echo Cancellation and Speech Enhancement

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Spatio-Temporal Attention Pooling for Audio Scene Classification

Multi-scale Convolutional Recurrent Neural Network and Data Augmentation for Polyphonic Sound Event Detection

Double Branches and Stages Neural Network for Joint Acoustic Echo and Noise Suppression

High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

Efficient Acoustic Echo Suppression with Condition-Aware Training

A Low-Compexity Deep Learning Framework For Acoustic Scene Classification

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Speech Enhancement for Cochlear Implant Recipients using Deep Complex Convolution Transformer with Frequency Transformation

Phase Continuity-Aware Self-Attentive Recurrent Network with Adaptive Feature Selection for Robust VAD

An Attention-Based Time-Frequency Pyramid Pooling Strategy in Deep Convolutional Networks for Acoustic Scene Classification

Enhanced noise resilience in passive tone detection via broad-receptive field complex-valued convolutional neural networks

An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition