Scene graph semantic inference for image and text matching

Jiaming Pei,Kaiyang Zhong,Zhi Yu,Lukun Wang,Kuruva Lakshmanna
DOI: https://doi.org/10.1145/3563390
IF: 1.471
2022-09-14
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:With the rapid development of information technology, image and text data have increased dramatically. Image and text matching techniques enable computers to understand information from both visual and text modalities and match them based on semantic content. Existing methods focus on visual and textual object co-occurrence statistics and learning coarse-level associations. However, the lack of intramodal semantic inference leads to the failure of fine-level association between modalities. Scene graphs can capture the interactions between visual and textual objects and model intramodal semantic associations, which are crucial for the understanding of scenes contained in images and text. In this paper, we propose a novel scene graph semantic inference network (SGSIN) for image and text matching that effectively learns fine-level semantic information in vision and text to facilitate bridging cross-modal discrepancies. Specifically, we design two matching modules and construct scene graphs within each matching module for aggregating neighborhood information to refine the semantic representation of each object and achieve fine-level alignment of visual and textual modalities. We perform extended experiments in Flickr30k and MSCOCO and achieve state-of-the-art results, which validate the advantages of our proposed approach.
computer science, artificial intelligence
What problem does this paper attempt to address?