Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Wenxuan Wu,Xueyuan Chen,Xixin Wu,Haizhou Li,Helen Meng
2024-03-24
Abstract:Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively use audio - video synchronization information to improve the Audio - Visual Target Speech Extraction (A V - TSE) system. Specifically, the paper explores integrating the pre - trained A V - HuBERT model into the A V - TSE system and proposes a new self - supervised learning strategy - the Mask - And - Recover (MAR) strategy to enhance system performance. ### Main Problems and Challenges 1. **Effective Use of Audio - Video Synchronization Information** - One of the main challenges in A V - TSE is how to effectively use audio - video synchronization information during the process. - Traditional TSE systems usually rely on audio cues but ignore the importance of visual information, especially lip movements. 2. **Improving the Accuracy of Target Speech Extraction** - The paper aims to improve the accuracy of target speech extraction by introducing the pre - trained A V - HuBERT model to provide richer visual cues. - The pre - trained A V - HuBERT model performs well in lip - reading tasks and can capture audio - video synchronization information, which is very beneficial for the TSE system. 3. **Application of Self - Supervised Learning Strategy** - The proposed Mask - And - Recover (MAR) strategy for self - supervised learning aims to further optimize the alignment of audio - video features through a masking and recovery mechanism. - The MAR strategy enhances the robustness and generalization ability of the system by masking part of the audio signal and attempting to recover these masked parts. ### Solutions 1. **Integrating the Pre - trained A V - HuBERT Model** - Integrate the pre - trained A V - HuBERT model into the TSE system to utilize its strong audio - video synchronization capturing ability. - This can better handle the correspondence between audio and visual information, especially in complex environments. 2. **Proposing the MAR Strategy** - The MAR strategy enhances the robustness of the system by masking part of the audio signal and attempting to recover these masked parts during the training process. - This method not only improves the self - supervised learning ability of the system but also enhances the effect of target speech extraction through the fusion of multi - modal information. ### Experimental Results The experimental results show that the proposed A VHuMAR - TSE system outperforms the existing baseline systems on the VoxCeleb2 dataset. Specifically: - On subjective and objective metrics such as SI - SDR, SDR, PESQ, and STOI, the A VHuMAR - TSE system has achieved significant improvements. - For example, SI - SDR has increased from 11.728 to 12.331, PESQ has increased from 2.765 to 2.922, and STOI has increased from 0.878 to 0.887. These results verify the effectiveness of the A V - HuBERT model and the MAR strategy, especially when dealing with the target speech extraction task in complex audio environments.