Referring Expression Comprehension Based on Cross Modal Feature Fusion and Iterative Reasoning.

Chao Zhang,Wei Wu,Yu Zhao
DOI: https://doi.org/10.1007/978-3-031-46314-3_26
2023-01-01
Abstract:The task of Referring Expression Comprehension is a multimodal task, which involves two different fields: Computer Vision and Natural Language Processing. Specifically, the task is to locate image region that correspond to the description provided in the given a image and a natural language expression. This paper aims to address the problem that the current task can not effectively fuse visual and textual features in the multimodal alignment stage and can not effectively utilize visual and textual formation in the prediction stage. Two improvement measures are proposed: multimodal feature fusion and iterative reasoning based on multimodal attention mechanism. In the multimodal feature fusion stage, three feature fusion modules are used to fuse visual and textual features from different perspectives to obtain rich visual and textual information; in the iterative reasoning stage, visual and textual features are accessed several times to gradually optimize the target prediction region. In order to verify the performance of the proposed method in this paper, a large number of experiments were conducted on three public datasets.
What problem does this paper attempt to address?