Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

Yaodong Wang,Zhong Ji,Di Wang,Yanwei Pang,Xuelong Li
DOI: https://doi.org/10.1016/j.knosys.2023.111318
IF: 8.139
2024-01-12
Knowledge-Based Systems
Abstract:Referring Expression Comprehension (REC) is a task that involves grounding a specific object in an image based on a given referring query in the form of bounding boxes. Existing fully-supervised or weakly-supervised REC methods rely on either manually annotated regional coordinates or query texts. In this paper, we propose an unsupervised training paradigm for the REC task that does not require any manual annotated data. Specifically, we introduce a V isual-Semantic-Parsing-based U nsupervised R eferring E xpression C omprehension framework (VUREC), which leverages a Visual Semantic Parser (VSP) as its core module to recognize the rich semantic information in images and construct pseudo-region-query pairs as the training supervision for REC. The VSP comprises a Scene Graph Parser (SGP) and a Visual Concept Detector (VCD) that can detect the locations, categories, attributes of objects, and visual relationships among them in images. Furthermore, we present a Referring Expression Reasoning (RER) model that utilizes a Multi-Modal Cascade Attention Decoder (MCAD) for fine-grained multi-modality fusion and regresses the four coordinates of the referential object directly. The experimental results on three benchmark datasets of Refcoco, Refcoco+ and Refcocog demonstrate the effectiveness of our proposed method.
computer science, artificial intelligence
What problem does this paper attempt to address?