Abstract:In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder–decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are improved by 4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.

Real-time Speech Enhancement with Dynamic Attention Span.

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Time-Variance Aware Real-Time Speech Enhancement

VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention

Monaural Speech Enhancement with Deep Residual-Dense Lattice Network and Attention Mechanism in the Time Domain

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

A Recursive Network with Dynamic Attention for Monaural Speech Enhancement

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Adaptive selection of local and non-local attention mechanisms for speech enhancement

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

Compact Deep Neural Networks for Real-Time Speech Enhancement on Resource-Limited Devices

Selector-Enhancer: Learning Dynamic Selection of Local and Non-local Attention Operation for Speech Enhancement

Improving Monaural Speech Enhancement with Dynamic Scene Perception Module

Incorporation of a modified temporal cepstrum smoothing in both signal-to-noise ratio and speech presence probability estimation for speech enhancement

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

Embedding Encoder-Decoder with Attention Mechanism for Monaural Speech Enhancement

Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder