Abstract:The "cocktail party problem", the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain's hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed "Reverse Cycle", across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency.

Multi-layer Attention Mechanism Based Speech Separation Model.

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

Cracking the cocktail party problem by multi-beam deep attractor network

Speaker and Direction Inferred Dual-channel Speech Separation

Two-stage Model and Optimal SI-SNR for Monaural Multi-Speaker Speech Separation in Noisy Environment

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Multi-layer encoder-decoder time-domain single channel speech separation

Speaker Verification based on Single Channel Speech Separation

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

Multi-layer Attention Mechanism for Speech Keyword Recognition

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

A Neural State-Space Modeling Approach to Efficient Speech Separation

Speech separation of a target speaker based on deep neural networks

A Neural State-Space Model Approach to Efficient Speech Separation

Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer

PGSS: Pitch-Guided Speech Separation.

Rethinking the Separation Layers in Speech Separation Networks