A Study of Visual Question Answering Techniques Based on Collaborative Multi-Head Attention

Yingli Yang,Jingxuan Jin,De Li
DOI: https://doi.org/10.1109/acctcs58815.2023.00037
2023-01-01
Abstract:In visual question answering task, the dominant approach recently has been to use a unified model for pre-training and fine tuning it. This unified model typically uses a transformer to fuse image and text information. In order to optimize the performance of the model on visual question answering task, this paper proposes a transformer architecture based on a collaborative multi-head attention mechanism to address the key/value projection redundancy problem in the multi-head attention mechanism of the transformer. In addition, this paper uses the Swin transformer model as the image feature extractor to extract multi-scale image information. Validation experiments are conducted on the VQA v2 dataset in this paper, and the experimental results show that applying the collaborative multi-head attention approach and the Swin transformer backbone to the visual question answering model can effectively improve the correct rate of the visual question answering task.
What problem does this paper attempt to address?