Visual Grounding With Joint Multimodal Representation and Interaction
Hong Zhu,Qingyang Lu,Lei Xue,Mogen Xue,Guanglin Yuan,Bineng Zhong
DOI: https://doi.org/10.1109/tim.2023.3324362
IF: 5.6
2023-11-10
IEEE Transactions on Instrumentation and Measurement
Abstract:This article tackles the challenging yet significant task of grounding a natural language query to the corresponding region onto an image. The main challenge in visual grounding is to model the correspondence between visual context and semantic concept referred by the language expression, i.e., multimodal fusion. Nevertheless, there is an inherent deficiency in the current fusion module designs, which makes visual and linguistic feature embeddings cannot be unified into the same semantic space. To address the issue, we present a novel and effective visual grounding framework based on joint multimodal representation and interaction (JMRI). Specifically, we propose to perform image–text alignment in a multimodal embedding space learned by a large-scale foundation model, so as to obtain semantically unified joint representations. Furthermore, the transformer-based deep interactor is designed to capture intramodal and intermodal correlations, rendering our model to highlight the localization-relevant cues for accurate reasoning. By freezing the pretrained vision-language foundation model and updating the other modules, we achieve the best performance with the lowest training cost. Extensive experimental results on five benchmark datasets with quantitative and qualitative analysis show that the proposed method performs favorably against the state-of-the-arts.
engineering, electrical & electronic,instruments & instrumentation