Multi-modal Fine-grained Retrieval with Local and Global Cross-Attention.
Qiaosong Chen,Ye Zhang,Junzhuo Liu,Zhixiang Wang,Xin Deng,Jin Wang
DOI: https://doi.org/10.1109/ICUFN57995.2023.10200668
2023-01-01
Abstract:The goal of cross-modal retrieval is that the user gives any sample as a query sample, and the system retrieves and feeds back various modal samples related to the query sample. At present, the cross-modal retrieval method mainly focuses on coarse-grained, which is far from being satisfied in practical application. However, there are many difficulties in fine-grained retrieval, such as the heterogeneous gap and semantic gap between multi-modal data, the difficulty of similarity measurement, and the small difference in fine-grained sample features. To overcome these limitations, we propose a novel multi-modal fine-grained retrieval method with the LAGC-Attention module, which can fully extract and fuse feature information from different modalities and represent them in a common space. Specifically, we use local and global cross self-attention to extract the neighboring and global context information for each single modal data, which greatly enhances the feature representation capability of each modality (image, text, audio, video), and especially reduce the gap between different feature distributions. Finally, Extensive experiments and ablation studies demonstrate that our method achieves state-of-the-art on the public dataset PKU FG-XMedia.