Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval

Jiang Zhukai,Lian Zhichao
DOI: https://doi.org/10.1007/978-3-031-15934-3_18
2022-01-01
Abstract:Image-text retrieval is a challenging task in the field of vision and language. The existing methods mainly compute the similarity of image-text pairs by the alignment between image regions and text words. Although these methods based on fine-grained local features achieve good results, these methods only explore the correspondence between salient objects and ignore the deep semantic information expressed by the whole image and text. Thus, we propose a novel multi-level local alignment and semantic matching network (MLASM) that introduces a multi-level semantic matching module after local alignment. This module supplies our model with more sufficient semantic information to understand the complex correlations between images and texts. Experiment results on two benchmark datasets Flickr30K and MS-COCO show that our MLASM achieves state-of-the-art performance.
What problem does this paper attempt to address?