MoCA: Incorporating domain pretraining and cross attention for textbook question answering
Fangzhi Xu,Qika Lin,Jun Liu,Lingling Zhang,Tianzhe Zhao,Qi Chai,Yudai Pan,Yi Huang,Qianying Wang
DOI: https://doi.org/10.1016/j.patcog.2023.109588
IF: 8
2023-04-09
Pattern Recognition
Abstract:Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. Also, it requires the model to take fully advantage of the complementary information of different diagram types, which pushes the multimodal fusion task to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates M ulti-stage d o main pretraining and C ross-guided multimodal A ttention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with a span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose a cross-guided multimodal attention mechanism to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble of three background retrievals. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods on the validation and test split respectively. Also, ablation and comparison experiments verify the effectiveness of each module proposed in our model.
computer science, artificial intelligence,engineering, electrical & electronic