Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao,Yuhta Takida,Yukara Ikemiya,Zhi Zhong,Chieh-Hsin Lai,Giorgio Fabbro,Kazuki Shimada,Keisuke Toyama,Kinwai Cheuk,Marco Martinez,Shusuke Takahashi,Stefan Uhlich,Taketo Akama,Woosung Choi,Yuichiro Koyama,Yuki Mitsufuji
2024-11-02
Abstract:We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
Sound,Information Retrieval,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use the intermediate representation of a single Music Foundation Model (MFM) to enhance the performance of various music downstream tasks. Specifically, the author proposes a music foundation model named SoniDo, which aims to extract hierarchical features from target music samples and use these features to improve a variety of downstream tasks, including understanding tasks and generation tasks. ### Specific background of the problem 1. **Requirements for music foundation models**: - Currently, although there are many foundation models for language processing (such as BERT, GPT, etc.), in the field of music, powerful foundation models that can handle multiple music downstream tasks are still lacking. - Music downstream tasks can be divided into two categories: understanding tasks (such as label classification, transcription) and generation tasks (such as mixing, mastering). 2. **Limitations of existing methods**: - Existing multi - task models need to include all target tasks in the training stage, which limits their flexibility. - Using pre - trained large - scale models to extract features and inject them into smaller task - specific models can effectively improve performance, but existing methods mainly focus on music understanding tasks and have limited support for generation tasks. ### Main contributions of the paper 1. **Proposing the SoniDo model**: - SoniDo is a generation model based on multi - level Transformers, with a multi - level hierarchical encoder that can extract hierarchical features from music samples. - These features can be used as a general - purpose enhancement tool for various music downstream tasks, including understanding and generation tasks. 2. **Verifying the effectiveness of hierarchical representation**: - The author assumes that hierarchical representation can provide an effective information hierarchy for all downstream tasks and proves this through experiments. - Specifically, the intermediate representation of SoniDo is not only beneficial for understanding tasks but can also significantly improve the performance of generation tasks. 3. **Experimental verification**: - The author conducts experiments on multiple representative tasks, including music label classification, music transcription, music source separation, and music mixing. - The experimental results show that the features extracted by SoniDo achieve significant performance improvements in these tasks, and even reach a new state - of - the - art (SOTA) level. ### Summary The core problem of this paper is to explore how to use the intermediate representation of a single music foundation model to enhance the performance of multiple music downstream tasks. By proposing the SoniDo model and conducting extensive experimental verification, the author proves the importance and effectiveness of hierarchical feature representation in music processing. This research paves the way for more efficient and accessible music processing solutions.