GPA: Global and Prototype Alignment for Audio-Text Retrieval

Yuxin Xie,Zhihong Zhu,Xianwei Zhuang,Liming Liang,Zhichang Wang,Yuexian Zou
DOI: https://doi.org/10.21437/interspeech.2024-1642
2024-01-01
Abstract:Recent Audio-Text Retrieval (ATR) models have achieved progressive results, which pursue semantic interaction upon audio and text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning between audio and text. In this paper, we present GPA for ATR to achieve both Global (coarse-grained) and Prototype (fine-grained) Alignment. In detail, apart from performing vanilla global contrast between audio and text pairs, we model the frames in audio and words in text as prototypes, and align the prototypes to generate a prototype similarity matrix. Based on this, we introduce a Learnable Attention Similarity Scoring module, which can fully consider the information between different prototype pairs and obtain the retrieval score. Finally, we incorporate the Sinkhorn-Knopp algorithm to modify the retrieval score. Experimental results on two benchmark datasets with superior performance justify the efficacy of our proposed GPA.
What problem does this paper attempt to address?