Abstract:This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to improve the performance of Sound Event Localization and Detection (SELD) in real - 360 - degree audio - video scenes. Specifically, the paper proposes solutions to the following challenges: 1. **Multi - modal information fusion**: Traditional SELD systems mainly rely on audio information, but in the actual environment, sound events are usually audiovisual combinations. Therefore, using only audio information may lead to inaccurate localization and detection. To solve this problem, the paper proposes a method of combining audio and video information. 2. **Data scarcity**: Audio - video data sets in the real world are relatively limited, which restricts the training effect of the model. For this reason, the paper introduces data augmentation techniques and synthetic data generation methods to increase the diversity and quantity of training data. 3. **Limitations of existing models**: The performance of existing audio - visual SELD models is limited, especially when dealing with complex scenes. The paper improves the overall performance of the model by improving the model architecture, especially the visual branch. ### Main contributions 1. **Data augmentation techniques**: Methods such as Audio Channel Swap (ACS) and Video Pixel Swap (VPS) are adopted to increase the quantity and diversity of the original data. 2. **Synthetic data generation**: A 360 - degree audio - video synthetic data generator is constructed to generate spatial audio and video data. 3. **Model architecture improvement**: - Advanced object detectors such as YOLO and DETI C are introduced to enhance the performance of the visual branch. - A multi - head self - attention (MHSA) layer is added to improve the model's ability to localize and detect sound sources. ### Summary Through these improvements, the audio - visual SELD system proposed in the paper is significantly superior to the existing baseline models in multiple evaluation metrics, especially in terms of Direction of Arrival (DoA) Estimation, Localization Error (LE) and Recall Rate (LR).

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

PSELDNets: Pre-trained Neural Networks on Large-scale Synthetic Datasets for Sound Event Localization and Detection

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Polyphonic sound event localization and detection using channel-wise FusionNet

The NERCSLIP-USTC System for the L3DAS23 Challenge Task2: 3D Sound Event Localization and Detection (SELD)

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Leveraging Geometrical Acoustic Simulations of Spatial Room Impulse Responses for Improved Sound Event Detection and Localization

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks