Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

Peihan Miao,Wei Su,Gaoang Wang,Xuewei Li,Xi Li
DOI: https://doi.org/10.1109/tip.2023.3334099
IF: 10.6
2024-01-01
IEEE Transactions on Image Processing
Abstract:As an important and challenging problem in vision-language tasks, referringexpression comprehension (REC) generally requires a large amount ofmulti-grained information of visual and linguistic modalities to realizeaccurate reasoning. In addition, due to the diversity of visual scenes and thevariation of linguistic expressions, some hard examples have much more abundantmulti-grained information than others. How to aggregate multi-grainedinformation from different modalities and extract abundant knowledge from hardexamples is crucial in the REC task. To address aforementioned challenges, inthis paper, we propose a Self-paced Multi-grained Cross-modal InteractionModeling framework, which improves the language-to-vision localization abilitythrough innovations in network structure and learning mechanism. Concretely, wedesign a transformer-based multi-grained cross-modal attention, whicheffectively utilizes the inherent multi-grained information in visual andlinguistic encoders. Furthermore, considering the large variance of samples, wepropose a self-paced sample informativeness learning to adaptively enhance thenetwork learning for samples containing abundant multi-grained information. Theproposed framework significantly outperforms state-of-the-art methods on widelyused datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets,demonstrating the effectiveness of our method.
What problem does this paper attempt to address?