Abstract:In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage of the network. Instead, we propose a more intuitive strategy that separates features earlier by expanding the feature sequence to the number of speakers as an extra dimension. To achieve this, an asymmetric strategy is presented in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, which also performs cross-speaker processing. Without relying on speaker information, the weight-shared network in the decoder directly learns to discriminate features using a separation objective. In addition, to improve performance, traditional methods have extended the sequence length, leading to the adoption of dual-path models, which handle the much longer sequence effectively by segmenting it into chunks. To address this, we introduce global and local Transformer blocks that can directly handle long sequences more efficiently without chunking and dual-path processing. The experimental results demonstrated that this asymmetric structure is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model combining both of these achieved state-of-the-art performance with much less computation in various benchmark datasets.

DE-DPCTnet: Deep Encoder Dual-path Convolutional Transformer Network for Multi-channel Speech Separation

DPTNet-based Beamforming for Speech Separation

Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation

Hybrid Attention Transformer Based on Dual-Path for Time-Domain Single-Channel Speech Separation

A New Neural Beamformer for Multi-channel Speech Separation

Parallel-Path Transformer Network for Time-Domain Monaural Speech Separation

DFBNet: Deep Neural Network Based Fixed Beamformer for Multi-channel Speech Separation

Iteratively Refined Multi-Channel Speech Separation

Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation

Single-Channel Speech Separation Focusing on Attention DE.

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Dasformer: Deep Alternating Spectrogram Transformer For Multi/Single-Channel Speech Separation

Don’t Shoot Butterfly with Rifles: Multi-Channel Continuous Speech Separation with Early Exit Transformer

Dual-path Transformer Based Neural Beamformer for Target Speech Extraction

DPATD: Dual-Phase Audio Transformer for Denoising

Multi-layer encoder-decoder time-domain single channel speech separation

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention

Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation