Attention-based convolutional neural networks for acoustic scene classification.

Zhao Ren,Qiuqiang Kong,Kun Qian,Mark D Plumbley,Bj¨orn W Schuller
2018-01-01
Abstract:We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASPChallengeonDetectionandClassificationofAcousticScenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6%,whichisanimprovementof12.9%whencomparedwiththe official baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8%, 58.3%, and 58.3% for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).
What problem does this paper attempt to address?