Abstract:To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.

Boosting the Performance of SpEx+ by Attention and Contextual Mechanism

Boosting the Performance of SpEx plus by Attention and Contextual Mechanism

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Conditional Diffusion Model for Target Speaker Extraction

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Multi-Level Speaker Representation for Target Speaker Extraction

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Selector-Enhancer: Learning Dynamic Selection of Local and Non-local Attention Operation for Speech Enhancement

Selective Listening by Synchronizing Speech with Lips

Binaural Selective Attention Model for Target Speaker Extraction

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications