Improved Self-Consistency Training with Selective Feature Fusion for Sound Event Detection

Mingyu Wang,Yunlong Li,Ying Hu
DOI: https://doi.org/10.1109/ICICSP59554.2023.10390568
2023-01-01
Abstract:Sound event detection (SED) is a joint task of identifying the categories and time boundaries of sound events within an audio clip. In this paper, we propose an improved self-consistency training (ISCT) strategy for semi-supervised SED based on Mean Teacher (MT) method. For teacher and student models, each adopts two branches with the same CRNN structure, the two branches help training the model by means of consistency regularization. ISCT strategy incorporates self-consistency loss on the basis of MT loss to improve the generalization performance of the model. A selective feature fusion (SFF) module is designed for applying in the shallow layers of the feature extraction part to selectively fuse the features with different scales. A parallel attention (PA) module is designed for applying in the deep layers of the feature extraction part to obtain much richer high-level features by the channel and spatial-wise attention. Ablation experiments verify the effectiveness of our proposed ISCT strategy, SFF and PA modules. In addition, compared with four methods, our proposed method achieves competitive performance on the DCASE 2020 task4 dataset.
What problem does this paper attempt to address?