DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Yifei Xin,Xuxin Cheng,Zhihong Zhu,Xusheng Yang,Yuexian Zou
2024-10-17
Abstract:Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(query), leading to difficulties in discerning out-of-distribution data. In this work, we attempt to tackle this constraint through a generative perspective and model the relationship between audio and text as their joint probability p(candidates,query). To this end, we present a diffusion-based ATR framework (DiffATR), which models ATR as an iterative procedure that progressively generates joint distribution from noise. Throughout its training phase, DiffATR is optimized from both generative and discriminative viewpoints: the generator is refined through a generation loss, while the feature extractor benefits from a contrastive loss, thus combining the merits of both methodologies. Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach. Notably, without any alterations, our DiffATR consistently exhibits strong performance in out-of-domain retrieval settings.
Sound,Information Retrieval,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing Audio - Text Retrieval (ATR) methods when dealing with unseen or out - of - domain data. Specifically, existing ATR methods are mainly discriminative models, aiming to maximize the conditional likelihood \( p(\text{candidates} \mid \text{query}) \), but this method fails to fully consider the distribution of the data itself \( p(\text{query}) \), resulting in poor performance when dealing with out - of - domain data. To solve this problem, the author introduced a generative framework based on the diffusion model (DiffATR), which improves the ATR task by modeling the joint probability distribution \( p(\text{candidates}, \text{query}) \) between audio and text. DiffATR regards ATR as an iterative process of gradually generating a joint distribution from noise, and combines two optimization perspectives of generation and discrimination to improve the generalization ability and adaptability of the model. ### Main contributions: 1. **Improve the ATR task from a generative perspective for the first time**: Propose a new ATR framework (DiffATR) based on the diffusion model, which models ATR as an iterative process of gradually generating a joint distribution from noise. 2. **Significantly improve the retrieval performance on multiple ATR benchmark datasets**, and can be generalized to different baseline models. 3. **Perform excellently in out - of - domain ATR tasks**, and can maintain good performance without additional adjustment. Through these improvements, DiffATR not only achieves better results on standard ATR tasks, but also shows stronger generalization ability on out - of - domain data.