Rethinking referring relationships from a perspective of mask-level relational reasoning
Chengyang Li,Liping Zhu,Gangyi Tian,Yi Hou,Heng Zhou
DOI: https://doi.org/10.1016/j.patcog.2022.109044
IF: 8
2022-09-27
Pattern Recognition
Abstract:Referring relationship aims at localizing subject and object entities in an image, according to a triple text < subject, predicate, object > . Previous methods use iterative attention to shift between image regions for modeling predicate. However, predicate sometimes is implicit and difficult to be represented in the image domain. Convolution modeling method to express predicate is simple and inappropriate. Besides, relational reasoning information in the text itself is not fully utilized. To this end, we rethink referring relationship from a mask-level relational reasoning perspective to improve model interpretability. For text-to-image reasoning, we design Mask Generate and Mask Transfer modules, so as to fully integrate the text priors into the reasoning and prediction of masks. For image-to-text reasoning, we propose an unsupervised triple reconstruction method to guide text-to-image reasoning and improve multimodal generalization. By bi-directional reasoning between image and text, the proposed method MRR fully conforms to the multimodal relational reasoning process. Experiments show that MRR achieves state-of-the-art performance on two datasets of referring relationships, VRD and Visual Genome.
computer science, artificial intelligence,engineering, electrical & electronic