Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

Jingcheng Ke,Dele Wang,Jun-Cheng Chen,I-Hong Jhuo,Chia-Wen Lin,Yen-Yu Lin
2024-09-05
Abstract:One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper attempts to address the issue of lower performance of graph-based methods compared to Transformer-based methods in the task of locating specific objects in images (i.e., Referring Expression Comprehension, REC). Specifically, the authors observe that current graph-based methods mainly face two challenges: 1. **Noise in the inference process**: Since most graph-based methods rely on off-the-shelf object detectors to locate candidate objects, this leads to significant noise from many irrelevant objects during the inference process. 2. **Inaccurate localization**: Due to the limitations of the detectors used, the localization of the target objects is not accurate enough. To address the above issues, the authors propose two methods: - **Dynamic Gate Constraint (DGC) module**: By using sub-expression guidance, it adaptively shuts down irrelevant nodes and their connections during the inference process, thereby reducing the impact of noise. - **Expression-guided Regression (EGR) strategy**: It utilizes expression information to refine the position prediction of candidate objects, thereby alleviating the inaccurate localization problem caused by the detectors. Experimental results show that the proposed DGC and EGR methods can significantly improve the performance of various graph-based REC methods, and their performance surpasses the latest Transformer-based methods without pre-training.