Abstract:The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) solutions focus solely on enhancing the magnitude response, recent research highlights the importance of phase in perceptual speech quality. Motivated by multi-task machine learning, this study proposes a deep complex convolution transformer network (DCCTN) for complex spectral mapping, which simultaneously enhances the magnitude and phase responses of speech. The proposed network leverages a complex-valued U-Net structure with a transformer within the bottleneck layer to capture sufficient low-level detail of contextual information in the T-F domain. To capture the harmonic correlation in speech, DCCTN incorporates a frequency transformation block in the encoder structure of the U-Net architecture. The DCCTN learns a complex transformation matrix to accurately recover speech in the T-F domain from a noisy input spectrogram. Experimental results demonstrate that the proposed DCCTN outperforms existing model solutions such as the convolutional recurrent network (CRN), deep complex convolutional recurrent network (DCCRN), and gated convolutional recurrent network (GCRN) in terms of objective speech intelligibility and quality, both for seen and unseen noise conditions. To evaluate the effectiveness of the proposed SE solution, a formal listener evaluation involving four CI recipients was conducted. Results indicate a significant improvement in speech intelligibility performance for CI recipients in noisy environments. Additionally, DCCTN demonstrates the capability to suppress highly non-stationary noise without introducing musical artifacts commonly observed in conventional SE methods.

Research on Speech Enhancement based on Full-scale Connection

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Monaural Speech Enhancement with Deep Residual-Dense Lattice Network and Attention Mechanism in the Time Domain

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Improving the Intelligibility of Electric and Acoustic Stimulation Speech Using Fully Convolutional Networks Based Speech Enhancement

Speech Enhancement for Cochlear Implant Recipients using Deep Complex Convolution Transformer with Frequency Transformation

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

An NMF-based MMSE Approach for Single Channel Speech Enhancement Using Densely Connected Convolutional Network

A two-stage full-band speech enhancement model with effective spectral compression mapping

Multi-stage Strength Estimation Network with Cross Attention for Single Channel Speech Enhancement

Speech Enhancement with Fullband-Subband Cross-Attention Network.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

End-to-End Speech Enhancement Using Fully Convolutional Networks with Skip Connections

Cross-domain Single-channel Speech Enhancement Model with Bi-projection Fusion Module for Noise-robust ASR

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra