AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Jiuxin Lin,Xinyu Cai,Heinrich Dinkel,Jun Chen,Zhiyong Yan,Yongqing Wang,Junbo Zhang,Zhiyong Wu,Yujun Wang,Helen Meng

2023-06-25

Abstract:Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

Multimedia,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses a problem in the task of Audio-Visual Target Speaker Extraction (AV-TSE): how to effectively integrate audio and video features to improve the performance of extracting a specific speaker's voice signal. Specifically, the paper focuses on the issue of temporal granularity differences caused by inconsistent audio and video sampling rates, which may negatively impact extraction performance. To solve the above problem, the authors propose a new model named AV-SepFormer. This model is based on the SepFormer architecture and combines cross-modal attention mechanisms with self-attention mechanisms to integrate audio and video information. The main contributions of AV-SepFormer include: 1. **Dual-Scale Design**: By segmenting audio features into multiple fragments equal in length to the video features, it achieves synchronization of audio and video features on the temporal scale, thus addressing the issue of temporal granularity mismatch between different modalities. 2. **Cross-Modal Attention Module**: Utilizing a cross-modal attention mechanism effectively integrates audio and video features. Compared to directly concatenating features, this method better captures the correlation between the two modalities. 3. **2D Positional Encoding**: Introducing a new 2D positional encoding method that considers not only the positional relationships within feature sequences but also the relative positional information between feature sequences. This helps the model better understand the spatial relationships of different modality features. Experimental results show that on multiple public datasets such as VoxCeleb2, LRS3, and TCD-TIMIT, AV-SepFormer achieves significant improvements in signal quality (SI-SDR) and perceptual quality (PESQ) compared to other advanced methods. Additionally, ablation experiments validate the effectiveness of the cross-modal attention mechanism and 2D positional encoding.

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Efficient Audiovisual Fusion for Active Speaker Detection.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

AVSegFormer: Audio-Visual Segmentation with Transformer

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Rethinking the visual cues in audio-visual speaker extraction

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Modality Attention for End-to-end Audio-visual Speech Recognition.

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics