LP-MusicCaps: LLM-Based Pseudo Music Captioning

SeungHeon Doh,Keunwoo Choi,Jongpil Lee,Juhan Nam

2023-07-31

Abstract:Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

Sound,Information Retrieval,Multimedia,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the issue of data scarcity in the field of automatic music captioning. Automatic music captioning refers to the task of generating natural language descriptions for given music segments, which is significant for understanding and organizing large amounts of music data. However, existing music-language datasets are limited in scale and costly and time-consuming to construct. To address the above issues, the authors propose a method based on large language models (LLMs) to generate a pseudo music caption dataset. Specifically, this method utilizes large language models to automatically generate descriptive sentences based on music tags. This approach can significantly increase the amount of data available for training, thereby alleviating the problem of data insufficiency. In this way, the authors created a pseudo music caption dataset named LP-MusicCaps, which contains approximately 2.2 million captions paired with 500,000 audio segments. Additionally, the authors designed a systematic evaluation scheme to assess the quality of music captions generated by large language models and trained a Transformer-based music captioning model, evaluating it under zero-shot and transfer learning settings. Experimental results show that the proposed model outperforms supervised baseline models on multiple metrics. In summary, the main contribution of this paper is the proposal of an effective method to address the data scarcity issue in the task of automatic music captioning, and the empirical study demonstrates the effectiveness and feasibility of the proposed method.

LP-MusicCaps: LLM-Based Pseudo Music Captioning

MidiCaps: A large-scale MIDI dataset with text captions

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

ALCAP: Alignment-Augmented Music Captioner

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

LLM-AD: Large Language Model based Audio Description System

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

CompCap: Improving Multimodal Large Language Models with Composite Captions

Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models

CapText: Large Language Model-based Caption Generation From Image Context and Description

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Music autotagging as captioning

Joint Music and Language Attention Models for Zero-shot Music Tagging

Soundscape Captioning using Sound Affective Quality Network and Large Language Model

MusicLM: Generating Music From Text

Video-driven musical composition using large language model with memory-augmented state space