QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding
Shengxin Chen,Gen Luo,Yiyi Zhou,Xiaoshuai Sun,Guannan Jiang,Rongrong Ji
DOI: https://doi.org/10.1145/3664647.3681058
2024-01-01
Abstract:Visual grounding is a task of locating the object referred by a natural language description. To reduce annotation costs, recent researchers are devoted into one-stage weakly supervised methods for visual grounding, which typically adopt the anchor-text matching paradigm. Despite the efficiency, we identify that anchor representations are often noisy and insufficient to describe object information, which inevitably hinders the vision-language alignments. In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch, we further propose an innovative strategy for effective weakly supervised learning, namely Active Query Selection (AQS). In particular, AQS aims to enhance the effectiveness of query-based contrastive learning by actively selecting high-quality query features. Through this strategy, AQS can greatly benefit the weakly supervised learning of QueryMatch. To validate our approach, we conduct extensive experiments on three benchmark datasets of two grounding tasks, i.e., referring expression comprehension (REC) and segmentation (RES). Experimental results not only show the state-of-art performance of QueryMatch in two tasks, e.g., over +5% [email protected] on RefCOCO in REC and over +20% mIOU on RefCOCO in RES, but also confirm the effectiveness of AQS in weakly supervised learning. Source codes are available at https://github.com/TensorThinker/QueryMatch.