Abstract:To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Masks Fusion with Multi-Target Learning For Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

A Time Domain Progressive Learning Approach with SNR Constriction for Single-Channel Speech Enhancement and Recognition

Correlated Multi-Level Speech Enhancement for Robust Real-World ASR Applications Using Mask-Waveform-Feature Optimization

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

A General Unfolding Speech Enhancement Method Motivated by Taylor's Theorem

Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition

A Hybrid Deep-Learning Approach for Single Channel HF-SSB Speech Enhancement