Abstract:Visual Text Question Answering (VTQA) is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. The main objective is to develop models that can accurately provide relevant answers based on complementary information from both images and text, as well as the semantic meaning of the question. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. This paper introduces a novel generative framework called VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. The framework incorporates a step-based ensemble method to enhance model performance and generalization ability. VTQAGen utilizes an encoder-decoder generative model based on BART. Faster R-CNN is employed to extract visual regions of interest, while BART's encoder is modified to handle multi-modal interaction. The decoder stage utilizes the shift-predict approach and introduces step-based logits fusion to improve stability and accuracy. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

VTQAGen: BART-based Generative Model For Visual Text Question Answering