On Local Temporal Embedding for Semi-Supervised Sound Event Detection
Lijian Gao,Qirong Mao,Ming Dong
DOI: https://doi.org/10.1109/taslp.2024.3369529
2024-01-01
Abstract:Semi-supervised sound event detection (SSED) task requires recognizing the categories of events and marking each event's onset and offset times in a mixed audio recording using a small amount of weakly labeled and a large scale of unlabeled data. So, exploring local temporal information, i.e., local discrimination and local correlations in the time domain, is essential for SSED, and in particular, for precise event boundary detection. Besides, as manual-labeled datasets are scarce, SSED tasks require effectively exploiting unlabelled data to reduce overfitting, typically through regularization techniques. Recently, self-supervised learning provided a viable solution to leverage unlabeled data for effective feature learning in various downstream tasks. In this paper, we propose LTE-Net, a novel multitask framework, to learn the Local Temporal Embedding for SSED. Specifically, LTE-Net first locally down-samples the input spectrogram and learns the token embeddings with a high temporal resolution (i.e., local discrimination). Then, LTE-Net effectively models the local correlations among the token embeddings through self-supervised masked spectrogram modeling. Finally, a novel joint (self- and semi-supervision) regularization framework is employed for the training of LTE-Net to effectively leverage unlabeled data in SSED. Extensive experiments on DCASE 2019, 2020 and 2021 SSED datasets show that LTE-Net significantly outperformed existing methods and achieved 2.1 to 8.7, 2.1 to 3.9 and 1.2 to 6.1 performance gains on the evaluation set in 2019, 2020 and 2021 datasets, respectively.
engineering, electrical & electronic,acoustics