Multiclass environmental sound classification model based on adding residual connections to self-attention layers
Mohammed M. Nasef,Mohammed M. Nabil,Amr M. Sauber,Nasef, Mohammed M.,Nabil, Mohammed M.,Sauber, Amr M.
DOI: https://doi.org/10.1007/s11042-024-18421-7
IF: 2.577
2024-02-08
Multimedia Tools and Applications
Abstract:Environmental Sound Classification (ESC) is a challenging and crucial task with various important real-world applications. Challenges arise from both inherent sound complexities and limitations in training data. Sound complexities in ESC comes from the fact that some audio frames can be misleading, requiring careful understanding of the full sound context. Additionally, data imbalances, limited samples, and large set of classes further complicate model training and generalization power. This paper proposes a novel Residual Self-Attention (RSA) model for robust end-to-end ESC. The proposed RSA model builds upon the Convolutional Self-Attention (CSA) architecture by incorporating residual connections between self-attention layers. This addition enhances information flow and facilitates faster convergence, reducing training time by 26% compared to CSA. Mel-frequency cepstral coefficients (MFCC) were used as input features and Softmax was used for classification. The proposed RSA model is evaluated on the three benchmark datasets: imbalanced UrbanSound8k, limited-sample ESC-10, and ESC-50 which have 50 classes. Remarkably, the proposed RSA model achieves impressive accuracies of 97.8%, 96.25%, and 93.31% on these datasets respectively, demonstrating its effectiveness in addressing diverse ESC challenges.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering