Image-text Retrieval via Preserving Main Semantics of Vision

Xu Zhang,Xinzheng Niu,Philippe Fournier-Viger,Xudong Dai
2023-04-28
Abstract:Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: <a class="link-external link-https" href="https://github.com/ZhangXu0963/VSL" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address a key issue in image-text retrieval, namely the model's inability to accurately capture the main semantics of an image in the presence of secondary content, leading to incorrect matches. #### Specific Problem Description: 1. **Cross-modal Retrieval**: The task requires constructing a common representation space between images and texts and learning the alignment relationship between images and texts to accurately measure the similarity of image-text pairs. 2. **Main Semantic Capture**: Due to the rich semantic information contained in images, secondary content may interfere with the model's capture of the main semantics of the image, leading to incorrect matches. Existing methods usually do not distinguish between the main content and secondary content in images. #### Solution Overview: The paper proposes a new metric learning method—Visual Semantic Loss (VSL), which utilizes the annotated text of images to assist the model in capturing the main semantics of the image, thereby improving the accuracy of image-text retrieval. #### Main Contributions: 1. **Identifying Challenges**: It points out the issue of incorrect matching of texts and images in cross-modal retrieval due to the model's focus on irrelevant main content. 2. **Proposing a Method**: It designs a new Visual Semantic Loss (VSL) function, which uses the annotated text of images to assist the model in capturing the main semantics, enhancing the consistency between the text and the main content of the image. 3. **Experimental Validation**: Extensive quantitative and qualitative experiments were conducted on two commonly used datasets (MSCOCO and Flickr30K), demonstrating the effectiveness and superiority of the proposed method.