EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal,Abid Hasan Zim,Md Asaduzzaman Tonmoy,Limengnan Zhou,Asad Malik,Minoru Kuribayashi

2024-08-24

Abstract:This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular feature analysis. The EAViT model integrates multi-head external attention (MEA) mechanisms into the Vision Transformer (ViT) framework, effectively capturing long-range dependencies and potential correlations between samples. This external attention (EA) mechanism employs learnable memory units that enhance the network's capacity to process complex audio features efficiently. The study demonstrates that EAViT achieves a remarkable overall accuracy of 93.99%, surpassing state-of-the-art models.

Sound,Information Retrieval,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the problem of music genre classification by proposing a new model—External Attention Vision Transformer (EA ViT) to improve the accuracy of audio classification. With the continuous growth of digital audio resources, the demand for precise and efficient audio classification systems is increasing, especially in applications such as music streaming platforms and environmental sound recognition. To meet this demand, researchers conducted experiments using the GTZAN dataset, which contains 1,000 music clips of different genres. Each 30-second audio clip was segmented into 3-second segments to enhance the robustness of the dataset and reduce the risk of overfitting. The EA ViT model introduces a Multi-Head External Attention (MEA) mechanism within the Vision Transformer framework, effectively capturing long-range dependencies and potential correlations between samples. The external attention mechanism uses learnable memory units to enhance the network's ability to process complex audio features. Experimental results show that the EA ViT model achieved an overall accuracy of 93.99% on the GTZAN dataset, significantly surpassing existing state-of-the-art models. Additionally, the paper provides a detailed analysis of the model's precision, recall, and F1 score across different music genres, demonstrating its robust performance in various classification tasks.

EAViT: External Attention Vision Transformer for Audio Classification

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

VSET: A MULTIMODAL TRANSFORMER FOR VISUAL SPEECH ENHANCEMENT

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Improved EATFormer: A Vision Transformer for Medical Image Classification

AVSegFormer: Audio-Visual Segmentation with Transformer

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

CAT: Causal Audio Transformer for Audio Classification

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Data Augmentation Vision Transformer for Fine-grained Image Classification

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Vision Transformer Segmentation for Visual Bird Sound Denoising

Siamese Vision Transformers are Scalable Audio-visual Learners

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Vision Augmentation Prediction Autoencoder with Attention Design (VAPAAD)

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Transavs: End-To-End Audio-Visual Segmentation With Transformer