Abstract:Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively use audio - video synchronization information to improve the Audio - Visual Target Speech Extraction (A V - TSE) system. Specifically, the paper explores integrating the pre - trained A V - HuBERT model into the A V - TSE system and proposes a new self - supervised learning strategy - the Mask - And - Recover (MAR) strategy to enhance system performance. ### Main Problems and Challenges 1. **Effective Use of Audio - Video Synchronization Information** - One of the main challenges in A V - TSE is how to effectively use audio - video synchronization information during the process. - Traditional TSE systems usually rely on audio cues but ignore the importance of visual information, especially lip movements. 2. **Improving the Accuracy of Target Speech Extraction** - The paper aims to improve the accuracy of target speech extraction by introducing the pre - trained A V - HuBERT model to provide richer visual cues. - The pre - trained A V - HuBERT model performs well in lip - reading tasks and can capture audio - video synchronization information, which is very beneficial for the TSE system. 3. **Application of Self - Supervised Learning Strategy** - The proposed Mask - And - Recover (MAR) strategy for self - supervised learning aims to further optimize the alignment of audio - video features through a masking and recovery mechanism. - The MAR strategy enhances the robustness and generalization ability of the system by masking part of the audio signal and attempting to recover these masked parts. ### Solutions 1. **Integrating the Pre - trained A V - HuBERT Model** - Integrate the pre - trained A V - HuBERT model into the TSE system to utilize its strong audio - video synchronization capturing ability. - This can better handle the correspondence between audio and visual information, especially in complex environments. 2. **Proposing the MAR Strategy** - The MAR strategy enhances the robustness of the system by masking part of the audio signal and attempting to recover these masked parts during the training process. - This method not only improves the self - supervised learning ability of the system but also enhances the effect of target speech extraction through the fusion of multi - modal information. ### Experimental Results The experimental results show that the proposed A VHuMAR - TSE system outperforms the existing baseline systems on the VoxCeleb2 dataset. Specifically: - On subjective and objective metrics such as SI - SDR, SDR, PESQ, and STOI, the A VHuMAR - TSE system has achieved significant improvements. - For example, SI - SDR has increased from 11.728 to 12.331, PESQ has increased from 2.765 to 2.922, and STOI has increased from 0.878 to 0.887. These results verify the effectiveness of the A V - HuBERT model and the MAR strategy, especially when dealing with the target speech extraction task in complex audio environments.

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Target Active Speaker Detection with Audio-visual Cues

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition