Abstract:This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the problem of sound event localization and detection (SELD) through audio-visual information fusion in low-resource real-world scenarios. Specifically, the researchers aim to improve the performance of sound event localization and detection by leveraging cross-modal learning and multimodal fusion techniques using both audio and video modalities. The main focus areas include: 1. **Data Scarcity**: In low-resource scenarios, the available real audio-visual data is very limited, with only 3.83 hours of data, posing a challenge for model training. 2. **Modal Fusion**: How to effectively combine audio and video information to improve the accuracy of sound event localization and detection. 3. **Data Augmentation**: An innovative Video Pixel Swap (VPS) method is proposed, extending the Audio Channel Swap (ACS) method to increase the diversity of multimodal data. 4. **Model Training**: A cross-modal Teacher-Student Learning (TSL) framework is designed to transfer the knowledge from a teacher model trained on a large amount of external audio data to a student model with only a small amount of multimodal data. ### Main Contributions 1. **Cross-Modal Teacher-Student Learning Framework**: By transferring the knowledge of the teacher model trained on rich external audio data to the student model, the performance of the student model in low-resource scenarios is improved. 2. **Two-Stage Audio-Visual Fusion Strategy**: Including early feature fusion and late video-guided decision fusion, further improving localization accuracy. 3. **Efficient Video Pixel Swap Method**: Extending the audio channel swap method, increasing the multimodal data by 8 times, enhancing the robustness of the student model. ### Experimental Results The experimental results show that the proposed framework significantly improves SELD performance on the DCASE 2023 challenge dataset. Particularly, after using the Video-Guided Decision Fusion (VGDF) method, the final single audio-visual student model achieved a SELD score of 0.28, which is 15.2% higher than the second place in the challenge. ### Conclusion This study proposes a framework for audio-visual information fusion in low-resource real-world scenarios, significantly improving the performance of sound event localization and detection through cross-modal teacher-student learning, multimodal fusion, and innovative data augmentation methods. In the DCASE 2023 challenge, the framework performed among the top, being the only solution where the audio-visual system outperformed the pure audio system among all teams. Future research will further explore more cross-modal fusion strategies and inter-modal relationships to better utilize the useful clues in multimodal data.

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Efficient Audiovisual Fusion for Active Speaker Detection.

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

An Experimental Study on Sound Event Localization and Detection under Realistic Testing Conditions

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with Ambisonics Data Augmentation for Sound Event Localization and Detection

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder