Abstract:Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.

Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM)

Multilingual Music Genre Embeddings for Effective Cross-Lingual Music Item Annotation

A Survey on Music Genre Classification Using Multimodal Information Processing and Retrieval

Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition

Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach

Modeling the Music Genre Perception across Language-Bound Cultures

Graph-Based Multimodal Music Mood Classification in Discriminative Latent Space.

Improving Music Genre Classification from Multi-Modal Properties of Music and Genre Correlations Perspective

Exploring modality-agnostic representations for music classification

Exploiting Synchronized Lyrics And Vocal Features For Music Emotion Detection

Multimodal Music Mood Classification by Fusion of Audio and Lyrics.

MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

Machine learning for music genre: multifaceted review and experimentation with audioset

Brazilian Lyrics-Based Music Genre Classification Using a BLSTM Network

A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition

Exploring Genre and Success Classification through Song Lyrics using DistilBERT: A Fun NLP Venture

Music Genre Classification using Large Language Models

A multimodal deep learning algorithm for polyphonic music applied to music sentiment analysis and generation

A computational lens into how music characterizes genre in film

Leveraging Knowledge Bases And Parallel Annotations For Music Genre Translation

Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval