Abstract:The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) solutions focus solely on enhancing the magnitude response, recent research highlights the importance of phase in perceptual speech quality. Motivated by multi-task machine learning, this study proposes a deep complex convolution transformer network (DCCTN) for complex spectral mapping, which simultaneously enhances the magnitude and phase responses of speech. The proposed network leverages a complex-valued U-Net structure with a transformer within the bottleneck layer to capture sufficient low-level detail of contextual information in the T-F domain. To capture the harmonic correlation in speech, DCCTN incorporates a frequency transformation block in the encoder structure of the U-Net architecture. The DCCTN learns a complex transformation matrix to accurately recover speech in the T-F domain from a noisy input spectrogram. Experimental results demonstrate that the proposed DCCTN outperforms existing model solutions such as the convolutional recurrent network (CRN), deep complex convolutional recurrent network (DCCRN), and gated convolutional recurrent network (GCRN) in terms of objective speech intelligibility and quality, both for seen and unseen noise conditions. To evaluate the effectiveness of the proposed SE solution, a formal listener evaluation involving four CI recipients was conducted. Results indicate a significant improvement in speech intelligibility performance for CI recipients in noisy environments. Additionally, DCCTN demonstrates the capability to suppress highly non-stationary noise without introducing musical artifacts commonly observed in conventional SE methods.

TFCN: Temporal-Frequential Convolutional Network for Single-Channel Speech Enhancement

Inter-channel Conv-TasNet for multichannel speech enhancement

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Tensor-to-Vector Regression for Multi-channel Speech Enhancement based on Tensor-Train Network

2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network

Speech Enhancement Algorithm Based on a Convolutional Neural Network Reconstruction of the Temporal Envelope of Speech in Noisy Environments

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Speech Enhancement for Cochlear Implant Recipients using Deep Complex Convolution Transformer with Frequency Transformation

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

Time domain speech enhancement with CNN and time-attention transformer

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Multichannel Speech Enhancement without Beamforming

Improving the Intelligibility of Electric and Acoustic Stimulation Speech Using Fully Convolutional Networks Based Speech Enhancement