Language-based Audio Retrieval with GPT-Augmented Captions and Self-Attended Audio Clips

Fuyu Gu,Yang Gu,Yiyan Xu,Haoran Sun,Yushan Pan,Shengchen Li,Haiyang Zhang
DOI: https://doi.org/10.1109/cscwd61410.2024.10580534
2024-01-01
Abstract:With the explosion of user-generated content in recent years, efficient methods for organizing multimedia databases based on content and retrieving relevant items have become essential. Language-based audio retrieval seeks to find relevant audio clips based on natural language queries. However, there exists a scarcity of datasets specifically developed for this task. Moreover, the language annotations often carry biases, leading to unsatisfactory retrieval accuracy. In this work, we propose a novel framework for language-based audio retrieval that aims to: 1) utilize GPT-generated text to augment audio captions, thereby improving language diversity; 2) employ audio self-attention mechanisms to capture intricate acoustic features and temporal dependencies. Experiments conducted on two public datasets, containing both short- and long-term audios, demonstrate that our framework can achieve significant performance improvements compared with other methods. Specifically, the proposed framework can achieve a 27% increase in mean average precision (mAP) on the Clotho dataset, and a 31% improvement in mAP on the AudioCaps dataset compared with the baseline.
What problem does this paper attempt to address?