HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

Yu Duan,Yun Xiong,Yao Zhang,Yuwei Fu,Yangyong Zhu
DOI: https://doi.org/10.1145/3460426.3463650
2021-01-01
Abstract:Semantic relationship information is important to the image-text retrieval task. Existing work usually extract relationship information by calculating the relationship value pairwise, which is hardly to find out a meaningful semantic relationship. A more reasonable method is to convert the modal to a scene graph, thereby explicitly modeling the relationship. Scene graph is a kind of graph data structure modeling the scene of modality. There are two concept in a scene graph, object and relationship. In image modal, object indicates the image region and relationship represents the predicate of the image regions. In text modal, object indicates the entity and relationship represents the association between entities, also known as semantic relationship. In image-text retrieval task, both object and relationship are important, and a key challenge is to obtain semantic information. In this paper, image and text are represented as two kinds of scene graphs: visual scene graph and textual scene graph, and then they are combined into Heterogeneous Scene Graph(HSG). By explicitly modeling relationships using directed graph, the information can be passed edge-wise. To further extract semantic information, we introduce the metapath, which can extract specific semantic information on specified path. Moreover, we propose Heterogeneous Message Passing(HMP) to communicate information on the metapath. After the message passing, the similarity of two modalities can be represented as the similarity of the graphs. Experiment shows that the model achieve competitive results on Flickr30K and MSCOCO, which indicates that our approach has advantages in image-text retrieval.
What problem does this paper attempt to address?