DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Yifei Xin,Xuxin Cheng,Zhihong Zhu,Xusheng Yang,Yuexian Zou

2024-10-17

Abstract:Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(query), leading to difficulties in discerning out-of-distribution data. In this work, we attempt to tackle this constraint through a generative perspective and model the relationship between audio and text as their joint probability p(candidates,query). To this end, we present a diffusion-based ATR framework (DiffATR), which models ATR as an iterative procedure that progressively generates joint distribution from noise. Throughout its training phase, DiffATR is optimized from both generative and discriminative viewpoints: the generator is refined through a generation loss, while the feature extractor benefits from a contrastive loss, thus combining the merits of both methodologies. Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach. Notably, without any alterations, our DiffATR consistently exhibits strong performance in out-of-domain retrieval settings.

Sound,Information Retrieval,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing Audio - Text Retrieval (ATR) methods when dealing with unseen or out - of - domain data. Specifically, existing ATR methods are mainly discriminative models, aiming to maximize the conditional likelihood \( p(\text{candidates} \mid \text{query}) \), but this method fails to fully consider the distribution of the data itself \( p(\text{query}) \), resulting in poor performance when dealing with out - of - domain data. To solve this problem, the author introduced a generative framework based on the diffusion model (DiffATR), which improves the ATR task by modeling the joint probability distribution \( p(\text{candidates}, \text{query}) \) between audio and text. DiffATR regards ATR as an iterative process of gradually generating a joint distribution from noise, and combines two optimization perspectives of generation and discrimination to improve the generalization ability and adaptability of the model. ### Main contributions: 1. **Improve the ATR task from a generative perspective for the first time**: Propose a new ATR framework (DiffATR) based on the diffusion model, which models ATR as an iterative process of gradually generating a joint distribution from noise. 2. **Significantly improve the retrieval performance on multiple ATR benchmark datasets**, and can be generalized to different baseline models. 3. **Perform excellently in out - of - domain ATR tasks**, and can maintain good performance without additional adjustment. Through these improvements, DiffATR not only achieves better results on standard ATR tasks, but also shows stronger generalization ability on out - of - domain data.

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Generation or Replication: Auscultating Audio Latent Diffusion Models

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

An Audio-textual Diffusion Model For Converting Speech Signals Into Ultrasound Tongue Imaging Data

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Retrieval-Augmented Text-to-Audio Generation

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Text Diffusion with Reinforced Conditioning