EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal,Abid Hasan Zim,Md Asaduzzaman Tonmoy,Limengnan Zhou,Asad Malik,Minoru Kuribayashi
2024-08-24
Abstract:This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular feature analysis. The EAViT model integrates multi-head external attention (MEA) mechanisms into the Vision Transformer (ViT) framework, effectively capturing long-range dependencies and potential correlations between samples. This external attention (EA) mechanism employs learnable memory units that enhance the network's capacity to process complex audio features efficiently. The study demonstrates that EAViT achieves a remarkable overall accuracy of 93.99%, surpassing state-of-the-art models.
Sound,Information Retrieval,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the problem of music genre classification by proposing a new model—External Attention Vision Transformer (EA ViT) to improve the accuracy of audio classification. With the continuous growth of digital audio resources, the demand for precise and efficient audio classification systems is increasing, especially in applications such as music streaming platforms and environmental sound recognition. To meet this demand, researchers conducted experiments using the GTZAN dataset, which contains 1,000 music clips of different genres. Each 30-second audio clip was segmented into 3-second segments to enhance the robustness of the dataset and reduce the risk of overfitting. The EA ViT model introduces a Multi-Head External Attention (MEA) mechanism within the Vision Transformer framework, effectively capturing long-range dependencies and potential correlations between samples. The external attention mechanism uses learnable memory units to enhance the network's ability to process complex audio features. Experimental results show that the EA ViT model achieved an overall accuracy of 93.99% on the GTZAN dataset, significantly surpassing existing state-of-the-art models. Additionally, the paper provides a detailed analysis of the model's precision, recall, and F1 score across different music genres, demonstrating its robust performance in various classification tasks.