Abstract:We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use the intermediate representation of a single Music Foundation Model (MFM) to enhance the performance of various music downstream tasks. Specifically, the author proposes a music foundation model named SoniDo, which aims to extract hierarchical features from target music samples and use these features to improve a variety of downstream tasks, including understanding tasks and generation tasks. ### Specific background of the problem 1. **Requirements for music foundation models**: - Currently, although there are many foundation models for language processing (such as BERT, GPT, etc.), in the field of music, powerful foundation models that can handle multiple music downstream tasks are still lacking. - Music downstream tasks can be divided into two categories: understanding tasks (such as label classification, transcription) and generation tasks (such as mixing, mastering). 2. **Limitations of existing methods**: - Existing multi - task models need to include all target tasks in the training stage, which limits their flexibility. - Using pre - trained large - scale models to extract features and inject them into smaller task - specific models can effectively improve performance, but existing methods mainly focus on music understanding tasks and have limited support for generation tasks. ### Main contributions of the paper 1. **Proposing the SoniDo model**: - SoniDo is a generation model based on multi - level Transformers, with a multi - level hierarchical encoder that can extract hierarchical features from music samples. - These features can be used as a general - purpose enhancement tool for various music downstream tasks, including understanding and generation tasks. 2. **Verifying the effectiveness of hierarchical representation**: - The author assumes that hierarchical representation can provide an effective information hierarchy for all downstream tasks and proves this through experiments. - Specifically, the intermediate representation of SoniDo is not only beneficial for understanding tasks but can also significantly improve the performance of generation tasks. 3. **Experimental verification**: - The author conducts experiments on multiple representative tasks, including music label classification, music transcription, music source separation, and music mixing. - The experimental results show that the features extracted by SoniDo achieve significant performance improvements in these tasks, and even reach a new state - of - the - art (SOTA) level. ### Summary The core problem of this paper is to explore how to use the intermediate representation of a single music foundation model to enhance the performance of multiple music downstream tasks. By proposing the SoniDo model and conducting extensive experimental verification, the author proves the importance and effectiveness of hierarchical feature representation in music processing. This research paves the way for more efficient and accessible music processing solutions.

Music Foundation Model as Generic Booster for Music Downstream Tasks

A Foundation Model for Music Informatics

Foundation Models for Music: A Survey

A Survey of Foundation Models for Music Understanding

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Computer Audition: From Task-Specific Machine Learning to Foundation Models

EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

Do Music Generation Models Encode Music Theory?

A Domain-Knowledge-Inspired Music Embedding Space and a Novel Attention Mechanism for Symbolic Music Modeling

Parameter-Efficient Transfer Learning for Music Foundation Models

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Audio Conditioning for Music Generation via Discrete Bottleneck Features

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

MuPT: A Generative Symbolic Music Pretrained Transformer

Exploiting Time-Frequency Conformers for Music Audio Enhancement

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

Hierarchical Attentive Deep Neural Networks for Semantic Music Annotation Through Multiple Music Representations

Specialized Foundation Models Struggle to Beat Supervised Baselines