Essay-Anchor Attentive Multi-Modal Bilinear Pooling for Textbook Question Answering

Juzheng Li,Hang Su,Jun Zhu,Bo Zhang
DOI: https://doi.org/10.1109/ICME.2018.8486468
2018-01-01
Abstract:Textbook Question Answering (TQA) [1] is a newly proposed task to answer arbitrary questions in middle school curricula, which has particular challenges to understand the long essays in additional to the images. Bilinear models [2], [3] are effective at learning high-level associations between questions and images, but are inefficient to handle the long essays. In this paper, we propose an Essay-anchor Attentive Multi-modal Bilinear pooling (EAMB), a novel method to encode the long essays into the joint space of the questions and images. The essay-anchors, embedded from the keywords, represent the essay information in a latent space. We propose a novel network architecture to pay special attention on the keywords in the questions, consequently encoding the essay information into the question features, and thus the joint space with the images. We then use the bilinear models to extract the multi-modal interactions to obtain the answers. EAMB successfully utilizes the redundancy of the pre-trained word embedding space to represent the essay-anchors. This avoids the extra learning difficulties from exploiting large network structures. Quantitative and qualitative experiments show the outperforming effects of EAMB on the TQA dataset.
What problem does this paper attempt to address?