Abstract:The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) solutions focus solely on enhancing the magnitude response, recent research highlights the importance of phase in perceptual speech quality. Motivated by multi-task machine learning, this study proposes a deep complex convolution transformer network (DCCTN) for complex spectral mapping, which simultaneously enhances the magnitude and phase responses of speech. The proposed network leverages a complex-valued U-Net structure with a transformer within the bottleneck layer to capture sufficient low-level detail of contextual information in the T-F domain. To capture the harmonic correlation in speech, DCCTN incorporates a frequency transformation block in the encoder structure of the U-Net architecture. The DCCTN learns a complex transformation matrix to accurately recover speech in the T-F domain from a noisy input spectrogram. Experimental results demonstrate that the proposed DCCTN outperforms existing model solutions such as the convolutional recurrent network (CRN), deep complex convolutional recurrent network (DCCRN), and gated convolutional recurrent network (GCRN) in terms of objective speech intelligibility and quality, both for seen and unseen noise conditions. To evaluate the effectiveness of the proposed SE solution, a formal listener evaluation involving four CI recipients was conducted. Results indicate a significant improvement in speech intelligibility performance for CI recipients in noisy environments. Additionally, DCCTN demonstrates the capability to suppress highly non-stationary noise without introducing musical artifacts commonly observed in conventional SE methods.

Speech-Declipping Transformer with Complex Spectrogram and Learnerble Temporal Features

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

DCHT: Deep Complex Hybrid Transformer for Speech Enhancement

DPATD: Dual-Phase Audio Transformer for Denoising

SETransformer: Speech Enhancement Transformer

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

MULTI-TASK TRANSFORMER WITH INPUT FEATURE RECONSTRUCTION FOR DYSARTHRIC SPEECH RECOGNITION

Efficient Transformer for Direct Speech Translation

Ultra Fast Speech Separation Model with Teacher Student Learning

Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement

Time domain speech enhancement with CNN and time-attention transformer

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

ClST: A Convolutional Transformer Framework for Automatic Modulation Recognition by Knowledge Distillation

Transformer with Bidirectional Decoder for Speech Recognition

Restoring degraded speech via a modified diffusion model

Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement

Speech Enhancement for Cochlear Implant Recipients using Deep Complex Convolution Transformer with Frequency Transformation