Finetuning Language Models for Multimodal Question Answering

Xin Zhang,Wen Xie,Ziqi Dai,Jun Rao,Haokun Wen,Xuan Luo,Meishan Zhang,Min Zhang
DOI: https://doi.org/10.1145/3581783.3612837
2023-01-01
Abstract:To achieve multi-modal intelligence, AI must be able to process and respond to inputs from multimodal sources. However, many current question answering models are limited to specific types of answers, such as yes/no and number, and require additional human assessments. Recently, Visual-Text Question Answering (VQTA) dataset has been proposed to fix this gap. In this paper, we conduct an exhaustive analysis and exploration of this task. Specifically, we implement a T5-based multi-modal generative network that overcomes the limitations of traditional labeling space and provides more freedom in responses. Our approach achieve the best performance in both English and Chinese tracks in the VTQA challenge.
What problem does this paper attempt to address?