Simple and Effective Visual Question Answering in a Single Modality

Yuetan Lin,Zhangyang Pang,Yanan Li,Donghui Wang
DOI: https://doi.org/10.1109/icip.2016.7532764
2016-01-01
Abstract:Visual question answering (VQA) comes as a result of great development in computer vision and natural language processing, which requires deep understanding of images and questions and effective integration of them. Current works on VQA simply concatenated visual and textual features or compared them via dot product, which were unable to eliminate the semantic difference between them. We argue to transfer VQA problem into a single modality and propose a simple and effective baseline method, utilizing Long Short-Term Memory (LSTM) properties to filter particular information specified by questions in the generic descriptions of the image. We provide thorough analysis and extensive experiments on VQA benchmark dataset to discuss performance of different methods and prove the effectiveness of our proposed method.
What problem does this paper attempt to address?