Abstract:With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate potential hallucinations in the generated captions. Leveraging the AudioSet dataset and the proposed approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs, the largest audio-caption dataset at the time of writing. The models trained with AudioSetCaps achieve state-of-the-art performance on audio-text retrieval with R@1 scores of 46.3% for text-to-audio and 59.7% for audio-to-text retrieval and automated audio captioning with the CIDEr score of 84.8. As our approach has shown promising results with AudioSetCaps, we create another dataset containing 4.1 million synthetic audio-language pairs based on the Youtube-8M and VGGSound datasets. To facilitate research in audio-language learning, we have made our pipeline, datasets with 6 million audio-language pairs, and pre-trained models publicly available at <a class="link-external link-https" href="https://github.com/JishengBai/AudioSetCaps" rel="external noopener nofollow">this https URL</a>.

MidiCaps: A large-scale MIDI dataset with text captions

Text2midi: Generating Symbolic Music from Captions

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

MusicScore: A Dataset for Music Score Modeling and Generation

FakeMusicCaps: a Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models

Emotion4MIDI: a Lyrics-based Emotion-Labeled Symbolic Music Dataset

GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music

MusicTM-Dataset for Joint Representation Learning among Sheet Music, Lyrics, and Musical Audio

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

MusicLM: Generating Music From Text

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Symbolic Music Data Version 1.0

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Musecoco: Generating symbolic music from text

TextCaps: a Dataset for Image Captioning with Reading Comprehension