A Lightweight Visual Question Answering Model based on Semantic Similarity

Zhiming He,Jingping Zeng
DOI: https://doi.org/10.1145/3490725.3490736
2021-09-17
Abstract:The key of visual question answering is to learn the semantic alignment of image objects and question words. The typical methods use the attention mechanism to achieve this goal. However, calculating the attention weight of image objects and question keywords requires an attention function, a function usually required a large number of parameters. Focusing on this issue, this paper proposes a lightweight visual question answering model based on semantic similarity. Firstly, the image features and question features are mapped to the common visual-semantic space, and the multi-modal semantic similarity matrix is constructed by using cosine similarity. Then, the multi-level potential semantic space is further explored by using multi-channel convolution neural network to map the semantic similarity matrix into two different attention distributions. Finally, the joint representation of image and text is learned through the multimodal fusion, which will be fed into the classifier and to predict the correct answer. The co-attention achieved by the proposed method with very few parameters. The experiment results show that the proposed model can effectively learn multimodal semantic alignment with a small number of parameters and achieve competitive or better performance than the state-of-the-art methods on VQA v2.0 dataset.
What problem does this paper attempt to address?