HAAC: Hierarchical Audio Augmentation Chain for ACCDOA Described Sound Event Localization and Detection

Shichao Wu,Yongru Wang,Zhengxi Hu,Jingtai Liu
DOI: https://doi.org/10.1016/j.apacoust.2023.109541
IF: 3.614
2023-01-01
Applied Acoustics
Abstract:The goal of sound event localization and detection (SELD) is to detect the temporal occurrence activity of a known set of sound events and locate them in the spatial space. We argue that acquiring a large audio dataset is essential for one deep neural network-based SELD system learned as one supervised task. Nonetheless, gathering and annotating such datasets is a costly and time-intensive process. Hence, various data augmentation methods have attracted attention as a solution to increase sample diversity from the limited collections. In this paper, we propose to augment the limited audio samples for the deep neural network-based SELD system in two ways. One is the hierarchical audio augmentation chain (HAAC) proposed for the activity-coupled Cartesian direction of arrival output representation (ACCDOA) described SELD task. It consists of three waveform and spectrogram augmentation techniques, which are exquisitely assembled from the feature map augmentation to audio channel swapping, and finally sample mixup. Second, we propose to augment the training samples by generating more simulated audio samples and making the selected sound events list publicly available to the community. Experiments on the STARSS22 dataset showed that our HAAC audio augmentation chain greatly improved the SELD performance, which increased the sound event detection score by 24% and decreased the localization error by 12.1 degrees. We demonstrate it's one simple yet effective approach, compared to other data augmentation methods. Moreover, with more simulated audio samples, generated by convolving selected sound events with SRIRs, used for training, the SELD performance was improved greatly.
What problem does this paper attempt to address?