MetaVG: A Meta-Learning Framework for Visual Grounding

Chao Su,Zhi Li,Tianyi Lei,Dezhong Peng,Xu Wang
DOI: https://doi.org/10.1109/lsp.2023.3344374
2024-01-01
IEEE Signal Processing Letters
Abstract:Visual grounding aims at localizing objects in images using natural language expressions. This task can be challenging when there are significant differences between the distributions of the training and testing sets. Existing methods tend to excessively focus on the training sets, which could lead to overfitting, especially in small-sample scenarios. To address this issue, in this letter, we present a novel meta-learning-based training framework called MetaVG, for visual grounding. Our approach leverages bi-level optimization to adapt quickly to the target task, thereby alleviating the overfitting issue. To train MetaVG effectively, we propose a novel training mechanism called Random Uncorrelated Meta-training (RUM). This mechanism proposes to randomly load uncorrelated batches as support and query sets respectively in the data separation process, then utilize bi-level optimization to directly train the model on visual grounding datasets. Comprehensive experiments on four widely used datasets, as well as in small-sample scenarios, validate the efficacy of MetaVG.
What problem does this paper attempt to address?