Referring Expression Grounding by Marginalizing Scene Graph Likelihood
Daqing Liu,Hanwang Zhang,Zheng-Jun Zha,Fanglin Wang
2019-01-01
Abstract:We focus on task of grounding referring expressions images, e.g., localizing the white truck front of a one. To resolve this task fundamentally, one should first find out contextual objects (e.g., yellow truck) and then exploit them to disambiguate referent from other similar objects, by using attributes and relationships (e.g., white, yellow, in front of). However, it is extremely challenging to train such a model as ground-truth of contextual objects and their relationships are usually missing due to prohibitive annotation cost. Therefore, nearly all existing methods attempt to evade above joint grounding and reasoning process, but resort to a holistic association between sentence and region feature. As a result, they suffer from heavy parameters of fully-connected layers, poor interpretability, and limited generalization to unseen expressions. In this paper, we tackle this challenge by training and inference with proposed Marginalized Scene Graph Likelihood (MSGL). Specifically, we use scene graph: a graphical representation parsed from referring expression, where nodes are objects with attributes and edges are relationships. Thanks to conditional random field (CRF) built on scene graph, we can ground every object to its corresponding region, and perform reasoning with unlabeled contexts by marginalizing out them using sum-product belief propagation. Overall, our proposed MSGL is effective and interpretable, e.g., on three benchmarks, MSGL consistently outperforms state-of-the-arts while offers a complete grounding of all objects a sentence.