Abstract:To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.

Learnable Spectral Dimension Compression Mapping for Full-Band Speech Enhancement.

A two-stage full-band speech enhancement model with effective spectral compression mapping

Convolutional Recurrent MetriCGAN with Spectral Dimension Compression for Full-Band Speech Enhancement

Harmonic enhancement using learnable comb filter for light-weight full-band speech enhancement model

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation

Sub-band Knowledge Distillation Framework for Speech Enhancement

Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

Adaptive two-channel speech enhancement algorithm based on the modulation spectrum

Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

A Comprehensive Method to Improve Loudness Compensation and High-Frequency Speech Intelligibility for Digital Hearing Aids.

Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

Improving Monaural Speech Enhancement by Mapping to Fixed Simulation Space With Knowledge Distillation

Speech Enhancement Using U-Net with Compressed Sensing