Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Yanwei Xie,Daqing Liu,Xuejin Chen,Zheng-Jun Zha
DOI: https://doi.org/10.1145/3463945.3469055
2021-01-01
Abstract:Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. Proposal-based methods first detect all candidate objects in the image and then retrieve the target among those objects based on the language description, while proposal-free methods directly locate the region based on the language without any region proposals. However, the proposal-based methods suffer from separate region proposal networks that actually do not suit this task well, and the proposal-free methods are not able to perform fine-grained visual-language alignments to yield higher precision. To overcome the above drawbacks, we propose a language-conditioned region proposal and retrieval network that first detects those regions only related to the language and then retrieves the target region by compositional reasoning on the language. Specifically, the proposed network consists of a language-conditioned region proposal network (LC-RPN) to detect those language-related regions, and a language-conditioned region retrieval network (LC-RRN) to perform region retrieval with a full understanding of the language. A pre-training mechanism is proposed to teach our model knowledge about language decomposing and vision-language alignment. Experimental results demonstrate that our proposed method achieves leading performance with high inference speed on RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
What problem does this paper attempt to address?