Abstract:While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at <a class="link-external link-https" href="https://zenodo.org/record/7880637" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use multi - channel audio and video information to more accurately locate and detect sound events in real - life scenes. Specifically, the paper proposes an Audio - Visual Sound Event Localization and Detection (SELD) task, aiming to combine audio and video data to estimate the temporal activation and Direction of Arrival (DOA) of target sound events. In addition, the author also introduces a new audio - visual dataset named STARSS23 to support the research of this task. ### Specific description of the problem 1. **Existing challenges**: - Existing SELD systems mainly rely on multi - channel audio data and ignore the visual information related to sound sources. - In real - life scenes, sound events usually originate from visible objects. For example, footstep sounds come from the feet of walkers, and knocking sounds come from doors, etc. - Most current SELD datasets use synthetic audio data, which is difficult to reflect the complexity and diversity in real - life scenes. 2. **Proposed method**: - A new audio - visual SELD task is proposed, which combines multi - channel audio and video information to estimate the temporal activation and DOA of sound events. - The STARSS23 dataset is introduced. This dataset contains multi - channel audio, video, and spatio - temporal annotations of sound events, and can better simulate real - life scenes. 3. **Dataset characteristics**: - The STARSS23 dataset contains more than 7 hours of multi - channel audio and video recordings, covering 16 rooms and 57 participants. - There are 13 types of target sound events in the dataset, such as speech, music, footstep sounds, etc., and activity, DOA, and distance labels for each frame are provided. - The video and audio in the dataset are well - aligned and can be used for training and evaluating audio - visual SELD systems. 4. **Experimental verification**: - The paper develops and tests an audio - visual SELD system, demonstrating the positive role of visual information in improving SELD performance. - The effectiveness of the system is verified through multiple evaluation metrics, such as Error Rate of Location - Aware Detection (ER20°), F - score (F20°), Location Error of Classification - Aware Detection (LECD), and Location Recall of Classification - Aware Detection (LRCD). ### Summary The main contribution of the paper lies in proposing an SELD task that combines audio and video information and constructing a large - scale real - life scene dataset STARSS23 for this purpose. Through experimental verification, the superiority of the audio - visual fusion method in the sound event localization and detection task is proven. This provides new ideas and tools for future research, and is of great value especially in fields such as intelligent environment perception and smart home applications.

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Sound Event Detection and Localization with Distance Estimation

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

6DoF SELD: Sound Event Localization and Detection Using Microphones and Motion Tracking Sensors on self-motioning human

3D Audio-Visual Segmentation

Learning Spatially-Aware Language and Audio Embeddings

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Text-Queried Target Sound Event Localization

ASOD60K: An Audio-Induced Salient Object Detection Dataset for Panoramic Videos

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

Leveraging Geometrical Acoustic Simulations of Spatial Room Impulse Responses for Improved Sound Event Detection and Localization

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms