Multi-view and region reasoning semantic enhancement for image-text retrieval

Wengang Cheng,Ziyi Han,Di He,Lifang Wu
DOI: https://doi.org/10.1007/s00530-024-01383-z
IF: 3.9
2024-06-17
Multimedia Systems
Abstract:Image and text retrieval is a crucial topic in the fields of language and vision. The key to successful Image-Text retrieval is achieving accurate cross-modal representation and capturing essential correlations between image-sentence or words-regions. While existing work has designed intricate interactions to capture these correlations, challenges remain due to inadequate feature representations, such as insufficient text descriptions of image and ambiguous region representations. To address these challenges, we propose a novel approach, multi-view and region reasoning semantic enhancement, for image and text retrieval, which aims to enhance the semantic representation of features from both textual and visual modalities. Specifically, considering that an image can have multiple corresponding texts from different perspectives, with each text describing a single view, we devise a multi-view textual semantic enhancement module. This module takes advantage of the positive textual cues provided by corresponding image to make up for the limited knowledge in single-text views and produce a comprehensive image-based textual representation. Then, to address the semantic diversity of an image, we design a region reasoning semantic enhancement module that employs a graph structure to integrate both semantic and spatial reasoning knowledge from different regions, thereby clarifying the semantics of image regions and enhancing the overall semantic understanding of these areas. Extensive experiments and analyses demonstrate the superior performance of the proposed method on the Flickr30K and MSCOCO datasets, validating the effectiveness of the proposed solution.
computer science, information systems, theory & methods
What problem does this paper attempt to address?