Frequency-based CNN and attention module for acoustic scene classification

Nisan Aryal,Sang-Woong Lee
DOI: https://doi.org/10.1016/j.apacoust.2023.109411
IF: 3.614
2023-05-17
Applied Acoustics
Abstract:Acoustic scene classification (ASC) is an audio classification task that identifies the environment in which sounds are recorded. Audio-related machine learning algorithms suffer from the device mismatch problem; that is, when trained from audio data recorded from one device, the algorithms cannot generalize to audio samples recorded using another device. In this study, a novel convolutional neural network, called a frequency-aware convolutional neural network (FACNN), is introduced to solve the device mismatch problem by focusing on the frequency information of the audio samples. Furthermore, an attention module, called the frequency attention network (FANet), is introduced to generate an attention map based on the frequency information of the input feature maps. FANet helps the FACNN to focus on the important frequency information, thus improving performance. The proposed method is trained on the TAU Urban Acoustic Scenes 2019 Mobile development dataset and TAU Urban Acoustic Scenes 2020 Mobile development dataset. The proposed method achieves a state-of-the-art accuracy of 75.99% in the TAU Urban Acoustic Scenes 2019 Mobile development dataset and a competitive result of 72.6% in the TAU Urban Acoustic Scenes 2020 Mobile development dataset. In addition, a comparison of FANet with the convolutional block attention module (CBAM) and the squeeze-and-excitation network (SENet) was performed. The results show that FANet can mitigate the device mismatch problem by improving the performance of the unseen devices.
acoustics
What problem does this paper attempt to address?