Abstract:Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

Multiresolution and Multimodal Speech Recognition with Transformers

Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition

VSET: A MULTIMODAL TRANSFORMER FOR VISUAL SPEECH ENHANCEMENT

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

A Bidirectional Context Embedding Transformer for Automatic Speech Recognition

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

End-to-End Multi-speaker Speech Recognition with Transformer.

Speech Recognition Transformers: Topological-lingualism Perspective

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

AVATAR: Unconstrained Audiovisual Speech Recognition

Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Transformer-Transducers for Code-Switched Speech Recognition

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Transformers with convolutional context for ASR

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision