Abstract:The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at <a class="link-external link-https" href="https://auto-acd.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in existing audio - language datasets in the field of audio representation learning. Specifically: 1. **Insufficient data volume**: Existing audio - language datasets such as Clotho and AudioCaps are small in scale and difficult to support the training of large - scale models. 2. **Simple content**: The audio in these datasets usually contains only 1 to 3 sound events, and the description information is relatively simple and cannot provide rich context information. 3. **Difficult collection**: The collection process of high - quality audio descriptions is cumbersome and costly, relying on manual annotation and difficult to expand. To solve these problems, the author proposes an innovative automated audio captioning pipeline and constructs a large - scale, high - quality audio - language dataset named Auto - ACD. This dataset contains more than 1.9 million pairs of audio - text pairs and aims to improve audio representation learning in the following ways: - **Automatically generate high - quality descriptions**: Use a series of public tools or APIs (such as visual scene understanding, object detection, environmental classification, etc.) to automatically generate detailed audio descriptions, reducing the need for manual annotation. - **Rich content information**: Not only describe the sound type and its source, but also provide detailed information such as sound attributes and occurrence locations, increasing the richness and diversity of the description. - **Large - scale data support**: Significantly increase the scale of the dataset by extracting audio from existing large - scale video datasets and generating corresponding text descriptions. In addition, in order to verify the effectiveness of Auto - ACD, the author trains a variety of popular models on this dataset and shows their performance improvements in multiple downstream tasks (such as audio - language retrieval, audio captioning generation, environmental classification). At the same time, they also establish a new test set, providing a benchmark for audio - text tasks. In summary, this paper solves the limitations of existing datasets in terms of data volume, content complexity, and collection efficiency by constructing a large - scale, high - quality audio - language dataset, promoting the development of the audio representation learning field.

A Large-scale Dataset for Audio-Language Representation Learning

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

AudioBench: A Universal Benchmark for Audio Large Language Models

Taming Data and Transformers for Audio Generation

Audio Dialogues: Dialogues dataset for audio and music understanding

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

UniAudio: Towards Universal Audio Generation with Large Language Models

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Improving Text-To-Audio Models with Synthetic Captions