Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification

Liwen Zhang,Jiqing Han,Ziqiang Shi
DOI: https://doi.org/10.1109/LSP.2020.2996085
2020-01-01
IEEE Signal Processing Letters
Abstract:Convolutional networks have achieved the state-of-the-art performance on Acoustic Scene Classification (ASC). Given the Log Mel-Spectrogram of an audio sample, the network can extract useful semantic contents in a certain range receptive field by stacking local convolutional operations. However, the temporal relations between different receptive fields are not captured explicitly. In this letter, we propose an end-to-end 3D Convolutional Neural Network (CNN) for ASC, named SeNoT-Net, which can generate effective audio representations by capturing temporal relations from semantic neighbors of different receptive fields over time. The SeNoT-Net treats the Log-Mel spectrogram as an ordered segment-level sequence. For each segment, the residual block can produce the semantic feature maps, then the semantic neighbors over time (SeNoT) module is applied to capture the relations between each feature point in the feature maps and its top-$k$ semantic neighbors. The proposed SeNoT-Net outperforms most of the state-of-the-art CNN models on both DCASE 2018 and 2019 ASC datasets.
What problem does this paper attempt to address?