Image-text Retrieval via Preserving Main Semantics of Vision

Xu Zhang,Xinzheng Niu,Philippe Fournier-Viger,Xudong Dai

2023-04-28

Abstract:Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: <a class="link-external link-https" href="https://github.com/ZhangXu0963/VSL" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address a key issue in image-text retrieval, namely the model's inability to accurately capture the main semantics of an image in the presence of secondary content, leading to incorrect matches. #### Specific Problem Description: 1. **Cross-modal Retrieval**: The task requires constructing a common representation space between images and texts and learning the alignment relationship between images and texts to accurately measure the similarity of image-text pairs. 2. **Main Semantic Capture**: Due to the rich semantic information contained in images, secondary content may interfere with the model's capture of the main semantics of the image, leading to incorrect matches. Existing methods usually do not distinguish between the main content and secondary content in images. #### Solution Overview: The paper proposes a new metric learning method—Visual Semantic Loss (VSL), which utilizes the annotated text of images to assist the model in capturing the main semantics of the image, thereby improving the accuracy of image-text retrieval. #### Main Contributions: 1. **Identifying Challenges**: It points out the issue of incorrect matching of texts and images in cross-modal retrieval due to the model's focus on irrelevant main content. 2. **Proposing a Method**: It designs a new Visual Semantic Loss (VSL) function, which uses the annotated text of images to assist the model in capturing the main semantics, enhancing the consistency between the text and the main content of the image. 3. **Experimental Validation**: Extensive quantitative and qualitative experiments were conducted on two commonly used datasets (MSCOCO and Flickr30K), demonstrating the effectiveness and superiority of the proposed method.

Image-text Retrieval via Preserving Main Semantics of Vision

Image-text Retrieval with Main Semantics Consistency

Image-Text Retrieval with Cross-Modal Semantic Importance Consistency.

Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

Cross-Modal Image-Text Retrieval with Semantic Consistency

Visual Semantic Reasoning for Image-Text Matching

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

Commonsense-Guided Semantic and Relational Consistencies for Image-Text Retrieval

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval.

Beyond visual semantics: Exploring the role of scene text in image understanding

Multilateral Semantic Relations Modeling for Image Text Retrieval

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image Retrieval Based on Visual Semantics and RSSVM

Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

Multi-view visual semantic embedding for cross-modal image–text retrieval

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

Visual context learning based on textual knowledge for image-text retrieval