Environmental Sound Classification Via Time–Frequency Attention and Framewise Self-Attention-Based Deep Neural Networks

Bo Wu,Xiao-Ping Zhang
DOI: https://doi.org/10.1109/jiot.2021.3098464
IF: 10.6
2022-01-01
IEEE Internet of Things Journal
Abstract:Environmental sound classification (ESC) is crucial to understanding the surroundings in Internet of Things (IoT) applications. The state-of-the-art deep learning approaches do not have good ESC performance when there exists various clutter interference, which is common in IoT scenarios. In this article, we present a novel deep neural network framework based on time-frequency attention and framewise self-attention (TFFS-DNN). It consists of two major novel architectures: 1) gradient and 2) latent feature-based DNN to generate our time-frequency attention, which can locate the relevant time-frequency (i.e., spectral) features accurately, and self-attention normalization DNN to generate our framewise self-attentions which properly indicate the relevance of frames. By conjoining these two sorts of distinct and complementary attentions with spectrograms, we are able to identify the importance or relevance in terms of time, frequency, and frame of the sounds using TFFS-DNN, which helps in distinguishing clutter such as background as well as model interpretation to some extent. Thus, the proposed TFFS-DNN can classify environmental sounds with clutter. The evaluation using four real-world environmental sound data sets demonstrates the superior performance of the proposed framework over several state-of-the-art schemes. Notably, we achieve 79.23% classification accuracy on the UrbanSound data set, a raw environmental sound data set that is full of clutter. The ablation study demonstrates a relative 3%-9% improvement of classification accuracy by the proposed framework over the baseline deep model.
What problem does this paper attempt to address?