A Large-scale Dataset for Audio-Language Representation Learning

Luoyi Sun,Xuenan Xu,Mengyue Wu,Weidi Xie
2023-10-03
Abstract:The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at <a class="link-external link-https" href="https://auto-acd.github.io/" rel="external noopener nofollow">this https URL</a>.
Sound,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve several key problems in existing audio - language datasets in the field of audio representation learning. Specifically: 1. **Insufficient data volume**: Existing audio - language datasets such as Clotho and AudioCaps are small in scale and difficult to support the training of large - scale models. 2. **Simple content**: The audio in these datasets usually contains only 1 to 3 sound events, and the description information is relatively simple and cannot provide rich context information. 3. **Difficult collection**: The collection process of high - quality audio descriptions is cumbersome and costly, relying on manual annotation and difficult to expand. To solve these problems, the author proposes an innovative automated audio captioning pipeline and constructs a large - scale, high - quality audio - language dataset named Auto - ACD. This dataset contains more than 1.9 million pairs of audio - text pairs and aims to improve audio representation learning in the following ways: - **Automatically generate high - quality descriptions**: Use a series of public tools or APIs (such as visual scene understanding, object detection, environmental classification, etc.) to automatically generate detailed audio descriptions, reducing the need for manual annotation. - **Rich content information**: Not only describe the sound type and its source, but also provide detailed information such as sound attributes and occurrence locations, increasing the richness and diversity of the description. - **Large - scale data support**: Significantly increase the scale of the dataset by extracting audio from existing large - scale video datasets and generating corresponding text descriptions. In addition, in order to verify the effectiveness of Auto - ACD, the author trains a variety of popular models on this dataset and shows their performance improvements in multiple downstream tasks (such as audio - language retrieval, audio captioning generation, environmental classification). At the same time, they also establish a new test set, providing a benchmark for audio - text tasks. In summary, this paper solves the limitations of existing datasets in terms of data volume, content complexity, and collection efficiency by constructing a large - scale, high - quality audio - language dataset, promoting the development of the audio representation learning field.