Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Ya Jiang,Qing Wang,Jun Du,Maocheng Hu,Pengfei Hu,Zeyan Liu,Shi Cheng,Zhaoxu Nian,Yuxuan Dong,Mingqi Cai,Xin Fang,Chin-Hui Lee
2024-06-21
Abstract:This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble.
Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the problem of sound event localization and detection (SELD) through audio-visual information fusion in low-resource real-world scenarios. Specifically, the researchers aim to improve the performance of sound event localization and detection by leveraging cross-modal learning and multimodal fusion techniques using both audio and video modalities. The main focus areas include: 1. **Data Scarcity**: In low-resource scenarios, the available real audio-visual data is very limited, with only 3.83 hours of data, posing a challenge for model training. 2. **Modal Fusion**: How to effectively combine audio and video information to improve the accuracy of sound event localization and detection. 3. **Data Augmentation**: An innovative Video Pixel Swap (VPS) method is proposed, extending the Audio Channel Swap (ACS) method to increase the diversity of multimodal data. 4. **Model Training**: A cross-modal Teacher-Student Learning (TSL) framework is designed to transfer the knowledge from a teacher model trained on a large amount of external audio data to a student model with only a small amount of multimodal data. ### Main Contributions 1. **Cross-Modal Teacher-Student Learning Framework**: By transferring the knowledge of the teacher model trained on rich external audio data to the student model, the performance of the student model in low-resource scenarios is improved. 2. **Two-Stage Audio-Visual Fusion Strategy**: Including early feature fusion and late video-guided decision fusion, further improving localization accuracy. 3. **Efficient Video Pixel Swap Method**: Extending the audio channel swap method, increasing the multimodal data by 8 times, enhancing the robustness of the student model. ### Experimental Results The experimental results show that the proposed framework significantly improves SELD performance on the DCASE 2023 challenge dataset. Particularly, after using the Video-Guided Decision Fusion (VGDF) method, the final single audio-visual student model achieved a SELD score of 0.28, which is 15.2% higher than the second place in the challenge. ### Conclusion This study proposes a framework for audio-visual information fusion in low-resource real-world scenarios, significantly improving the performance of sound event localization and detection through cross-modal teacher-student learning, multimodal fusion, and innovative data augmentation methods. In the DCASE 2023 challenge, the framework performed among the top, being the only solution where the audio-visual system outperformed the pure audio system among all teams. Future research will further explore more cross-modal fusion strategies and inter-modal relationships to better utilize the useful clues in multimodal data.