LP-MusicCaps: LLM-Based Pseudo Music Captioning

SeungHeon Doh,Keunwoo Choi,Jongpil Lee,Juhan Nam
2023-07-31
Abstract:Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.
Sound,Information Retrieval,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of data scarcity in the field of automatic music captioning. Automatic music captioning refers to the task of generating natural language descriptions for given music segments, which is significant for understanding and organizing large amounts of music data. However, existing music-language datasets are limited in scale and costly and time-consuming to construct. To address the above issues, the authors propose a method based on large language models (LLMs) to generate a pseudo music caption dataset. Specifically, this method utilizes large language models to automatically generate descriptive sentences based on music tags. This approach can significantly increase the amount of data available for training, thereby alleviating the problem of data insufficiency. In this way, the authors created a pseudo music caption dataset named LP-MusicCaps, which contains approximately 2.2 million captions paired with 500,000 audio segments. Additionally, the authors designed a systematic evaluation scheme to assess the quality of music captions generated by large language models and trained a Transformer-based music captioning model, evaluating it under zero-shot and transfer learning settings. Experimental results show that the proposed model outperforms supervised baseline models on multiple metrics. In summary, the main contribution of this paper is the proposal of an effective method to address the data scarcity issue in the task of automatic music captioning, and the empirical study demonstrates the effectiveness and feasibility of the proposed method.