Abstract:Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

Resource-Efficient Separation Transformer

Exploring Self-Attention Mechanisms for Speech Separation

Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

TransMask: A Compact and Fast Speech Separation Model Based on Transformer

Papez: Resource-Efficient Speech Separation with Auditory Working Memory

Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Don’t Shoot Butterfly with Rifles: Multi-Channel Continuous Speech Separation with Early Exit Transformer

Ultra Fast Speech Separation Model with Teacher Student Learning

Monaural Multi-Speaker Speech Separation Using Efficient Transformer Model

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation

SETransformer: Speech Enhancement Transformer

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

LRTD: A Low-rank Transformer with Dynamic Depth and Width for Speech Recognition.

Dasformer: Deep Alternating Spectrogram Transformer For Multi/Single-Channel Speech Separation

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions