Abstract:With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate potential hallucinations in the generated captions. Leveraging the AudioSet dataset and the proposed approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs, the largest audio-caption dataset at the time of writing. The models trained with AudioSetCaps achieve state-of-the-art performance on audio-text retrieval with R@1 scores of 46.3% for text-to-audio and 59.7% for audio-to-text retrieval and automated audio captioning with the CIDEr score of 84.8. As our approach has shown promising results with AudioSetCaps, we create another dataset containing 4.1 million synthetic audio-language pairs based on the Youtube-8M and VGGSound datasets. To facilitate research in audio-language learning, we have made our pipeline, datasets with 6 million audio-language pairs, and pre-trained models publicly available at <a class="link-external link-https" href="https://github.com/JishengBai/AudioSetCaps" rel="external noopener nofollow">this https URL</a>.

Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

Bridging Language Gaps in Audio-Text Retrieval

Retrieval-Augmented Text-to-Audio Generation

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

Audio Retrieval with WavText5K and CLAP Training

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

RECAP: Retrieval-Augmented Audio Captioning

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Text-based Audio Retrieval by Learning from Similarities between Audio Captions

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Exploring the Role of Audio in Video Captioning

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation