Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Adrian S. Roman,Baladithya Balamurugan,Rithik Pothuganti
2024-01-29
Abstract:This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to improve the performance of Sound Event Localization and Detection (SELD) in real - 360 - degree audio - video scenes. Specifically, the paper proposes solutions to the following challenges: 1. **Multi - modal information fusion**: Traditional SELD systems mainly rely on audio information, but in the actual environment, sound events are usually audiovisual combinations. Therefore, using only audio information may lead to inaccurate localization and detection. To solve this problem, the paper proposes a method of combining audio and video information. 2. **Data scarcity**: Audio - video data sets in the real world are relatively limited, which restricts the training effect of the model. For this reason, the paper introduces data augmentation techniques and synthetic data generation methods to increase the diversity and quantity of training data. 3. **Limitations of existing models**: The performance of existing audio - visual SELD models is limited, especially when dealing with complex scenes. The paper improves the overall performance of the model by improving the model architecture, especially the visual branch. ### Main contributions 1. **Data augmentation techniques**: Methods such as Audio Channel Swap (ACS) and Video Pixel Swap (VPS) are adopted to increase the quantity and diversity of the original data. 2. **Synthetic data generation**: A 360 - degree audio - video synthetic data generator is constructed to generate spatial audio and video data. 3. **Model architecture improvement**: - Advanced object detectors such as YOLO and DETI C are introduced to enhance the performance of the visual branch. - A multi - head self - attention (MHSA) layer is added to improve the model's ability to localize and detect sound sources. ### Summary Through these improvements, the audio - visual SELD system proposed in the paper is significantly superior to the existing baseline models in multiple evaluation metrics, especially in terms of Direction of Arrival (DoA) Estimation, Localization Error (LE) and Recall Rate (LR).