Abstract:The "cocktail party problem", the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain's hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed "Reverse Cycle", across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency.

L-Tcn with the Help of Attention Weighting for the Speech Separation Task in the Reverberation Environment

Two-stage Model and Optimal SI-SNR for Monaural Multi-Speaker Speech Separation in Noisy Environment

Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Cracking the cocktail party problem by multi-beam deep attractor network

Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer

TFCnet: Time-Frequency Domain Corrector for Speech Separation

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

The Safety of Drug Therapy in Children.

Speaker and Direction Inferred Dual-channel Speech Separation

Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention

A Spectral-change-aware Loss Function for DNN-based Speech Separation.

A comprehensive study of speech separation: spectrogram vs waveform separation