Image-text Retrieval with Main Semantics Consistency
Yi Xie,Yangtao Wang,Yanzhao Xie,Xin Tan,Jingjing Li,Xiaocui Li,Weilong Peng,Maobin Tang,Meie Fang
DOI: https://doi.org/10.1145/3627673.3679619
2024-01-01
Abstract:Image-text retrieval (ITR) has been one of the primary tasks in cross-modal retrieval, serving as a crucial bridge between computer vision and natural language processing. Significant progress has been made to achieve global alignment and local alignment between images and texts by mapping images and texts into a common space to establish correspondences between these two modalities. However, the rich semantic content contained in each image may bring false matches, resulting in the matched text ignoring the main semantics but focusing on the secondary or other semantics of this image. To address this issue, this paper proposes a semantically optimized approach with a novel Main Semantics Consistency (MSC) loss function, which aims to rank the semantically most similar images (or texts) corresponding to the given query at the top position during the retrieval process. First, in each batch of image-text pairs, we separately compute (i) the image-image similarity, i.e., the similarity between every two images, (ii) the text-text similarity, i.e., the similarity between a group of texts (that belong to a certain image) and another group of texts (that belong to another image), and (iii) the image-text similarity, i.e., the similarity between each image and each text. Afterward, our proposed MSC effectively aligns the above image-image, image-text, and text-text similarity, since the main semantics of every two images will be highly close if their text descriptions remain highly semantically consistent. By this means, we can capture the main semantics of each image to be matched with its corresponding texts, prioritizing the semantically most related retrieval results. Extensive experiments on MSCOCO and FLICKR30K verify the superior performance of MSC compared with the SOTA image-text retrieval methods. The source code of this project is released at GitHub: https://github.com/xyi007/MSC.