STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada,Archontis Politis,Parthasaarathy Sudarsanam,Daniel Krause,Kengo Uchida,Sharath Adavanne,Aapo Hakala,Yuichiro Koyama,Naoya Takahashi,Shusuke Takahashi,Tuomas Virtanen,Yuki Mitsufuji
2023-11-14
Abstract:While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at <a class="link-external link-https" href="https://zenodo.org/record/7880637" rel="external noopener nofollow">this https URL</a>.
Sound,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use multi - channel audio and video information to more accurately locate and detect sound events in real - life scenes. Specifically, the paper proposes an Audio - Visual Sound Event Localization and Detection (SELD) task, aiming to combine audio and video data to estimate the temporal activation and Direction of Arrival (DOA) of target sound events. In addition, the author also introduces a new audio - visual dataset named STARSS23 to support the research of this task. ### Specific description of the problem 1. **Existing challenges**: - Existing SELD systems mainly rely on multi - channel audio data and ignore the visual information related to sound sources. - In real - life scenes, sound events usually originate from visible objects. For example, footstep sounds come from the feet of walkers, and knocking sounds come from doors, etc. - Most current SELD datasets use synthetic audio data, which is difficult to reflect the complexity and diversity in real - life scenes. 2. **Proposed method**: - A new audio - visual SELD task is proposed, which combines multi - channel audio and video information to estimate the temporal activation and DOA of sound events. - The STARSS23 dataset is introduced. This dataset contains multi - channel audio, video, and spatio - temporal annotations of sound events, and can better simulate real - life scenes. 3. **Dataset characteristics**: - The STARSS23 dataset contains more than 7 hours of multi - channel audio and video recordings, covering 16 rooms and 57 participants. - There are 13 types of target sound events in the dataset, such as speech, music, footstep sounds, etc., and activity, DOA, and distance labels for each frame are provided. - The video and audio in the dataset are well - aligned and can be used for training and evaluating audio - visual SELD systems. 4. **Experimental verification**: - The paper develops and tests an audio - visual SELD system, demonstrating the positive role of visual information in improving SELD performance. - The effectiveness of the system is verified through multiple evaluation metrics, such as Error Rate of Location - Aware Detection (ER20°), F - score (F20°), Location Error of Classification - Aware Detection (LECD), and Location Recall of Classification - Aware Detection (LRCD). ### Summary The main contribution of the paper lies in proposing an SELD task that combines audio and video information and constructing a large - scale real - life scene dataset STARSS23 for this purpose. Through experimental verification, the superiority of the audio - visual fusion method in the sound event localization and detection task is proven. This provides new ideas and tools for future research, and is of great value especially in fields such as intelligent environment perception and smart home applications.