Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

Longkun Peng,Gaoyun An,Qiuqi Ruan
DOI: https://doi.org/10.1109/icsp56322.2022.9965298
2022-01-01
Abstract:Visual Question Answering (VQA) is all about understanding images and questions. Existing Transformer-based methods achieve excellent performance by associating questions with image region objects and directly using a special classification token for answer prediction. However, answering a question only needs to focus on some specific keywords and image regions, while excessively computing the attention of questions and image region objects will introduce unnecessary noise. Meanwhile, the information from these two modalities cannot be fully utilized when directly using the classification token to predict the answer. To this end, we propose a Transformer-based Sparse Encoder and Answer Decoder (SEAD) model for visual question answering, in which a two-stream sparse Transformer module based on co-attention is built to enhance the most relevant visual features and textual descriptions inter-modality. Furthermore, a single-step answer decoder is proposed to fully exploit the information of both modalities in the answer prediction stage, and a strategy is designed that fully utilizes the ground truth to correct the visual relevance scores in the decoder to focus on salient objects in the image. Our model performs magnificently, as shown by experiment results on the VQA v2.0 benchmark dataset.
What problem does this paper attempt to address?