From West to East: Who can understand the music of the others better?

Charilaos Papaioannou,Emmanouil Benetos,Alexandros Potamianos
2023-07-19
Abstract:Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.
Sound,Computer Vision and Pattern Recognition,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is whether existing deep learning models in the field of Music Information Retrieval (MIR) can effectively learn and transfer music audio embeddings across cultures. Specifically, the researchers utilized six music datasets from different cultures (including Western pop music, traditional music from the Eastern Mediterranean, and Indian art music) and employed three different deep audio embedding models (two models based on Convolutional Neural Networks (CNN) and one model based on the Transformer architecture) to perform auto-tagging tasks. Through transfer learning, the researchers aimed to explore the performance of these models in different cultural contexts and identify which datasets, when used as the source domain, best support the tasks in the target domain. The experimental results show that transfer learning can achieve good performance both within the same culture and across different cultures, but the optimal source dataset varies depending on the target culture. Additionally, the study preliminarily reveals the similarities between different music cultures.