Abstract:Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.

Efficient Trainable Front-Ends for Neural Speech Enhancement

A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech

Trainable Adaptive Window Switching for Speech Enhancement

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

Trainable Frontend For Robust and Far-Field Keyword Spotting

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement

Accelerator-Aware Training for Transducer-Based Speech Recognition

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Convolutional gated recurrent unit networks based real-time monaural speech enhancement

Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement

FastSpeech: Fast, Robust and Controllable Text to Speech

DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Biomimetic Frontend for Differentiable Audio Processing