Bi-Directional Image-Text Retrieval with Position Attention and Similarity Filtering

Wei Zhang,Xiaowei Xu,Ye Tao,Xiaodong Wang,Chenglin Wang,Zhimin Wei
DOI: https://doi.org/10.1109/icivc55077.2022.9886989
2022-01-01
Abstract:In recent years, cross-modal retrieval has gradually become the frontier and hotspot of academic research at home and abroad, and it is an important direction for the future development of information retrieval. Some current methods improve the performance of image-text retrieval by exploring more comprehensive global image-text alignment information or capturing region-word local fine-grained alignment. However, previous methods did not mine more useful information to obtain more accurate matching scores. In this paper, we propose a location attention and similarity filtering network for image text retrieval. Specifically, we enhance visual-text joint embedding learning with global and local alignments. We then enhance more reliable relationships between images and text sentences by exploring the location information of objects in images through location attention. In addition, we use a similarity filtering mechanism to selectively focus on important and representative alignment information while leaving the distraction of meaningless alignment information aside to effectively integrate these alignments. Experiments on our proposed method on the public datasets Flickr30K and MS-COCO validate the effectiveness and superiority of our method.
What problem does this paper attempt to address?