Hierarchical Attention Networks for Multimodal Machine Learning

Haotian Liang,Zhanqing Wang
DOI: https://doi.org/10.1088/1742-6596/2218/1/012020
2022-03-01
Journal of Physics: Conference Series
Abstract:Abstract The Visual Question Answering (VQA) task is to infer the correct answer to a free-form question based on the given image. This task is challenging because it requires model handling both visual and textual information. Most successful attempts on VQA task have been achieved by using attention mechanism which can capture inter-modal and intra-modal dependencies. In this paper, we raise a new attention-based model to solve VQA. We use question information to guide model concentrate on special regions and attribute and hierarchically reason the answer. We also propose multi-modal fusion strategy based on co-attention method to fuse both visual and textual information. Under the same experimental conditions, extensive experiments on VQA-v2.0 dataset illustrate our method performance exceeds the performance of some state-of-the-art methods of the same experimental conditions.
What problem does this paper attempt to address?