CroMIC-QA: the Cross-Modal Information Complementation Based Question Answering

Shun Qian,Bingquan Liu,Chengjie Sun,Zhen Xu,Lin Ma,Baoxun Wang
DOI: https://doi.org/10.1109/tmm.2023.3326616
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:This paper proposes a new multi-modal question-answering task, named as Cross-Modal Information Complementation based Question Answering (CroMIC-QA), to promote the exploration on bridging the semantic gap between visual and linguistic signals. The proposed task is inspired by the common phenomenon that, in most user-generated QA scenarios, the information of the given textual question is incomplete, and thus it is required to merge the semantics of both the text and the accompanying image to infer the complete real question. In this work, the CroMIC-QA task is first formally defined and compared with the classic Visual Question Answering (VQA) task. On this basis, a specified dataset, CroMIC-QA-Agri, is collected from an online QA community in the agriculture domain for the proposed task. A group of experiments is conducted on this dataset, with the typical multi-modal deep architectures implemented and compared. The experimental results show that the appropriate text/image presentations and text-image semantic interaction methods are effective to improve the performance of the framework.
What problem does this paper attempt to address?