Dynamic Alternative Attention for Visual Question Answering

Liu Xumeng,Guo Wenya,Zhang Yuhao,Zhang Ying
DOI: https://doi.org/10.1007/978-3-031-20309-1_33
2022-01-01
Abstract:In recent years, researchers have focused on Visual Question Answering (VQA) due to its numerous real-world applications. And visual attention mechanisms are widely used to assist answer prediction by selecting important regions. Nevertheless, few works consider the process of how the model progressively selects informative regions. To simulate the dynamic reasoning process of human beings, the existing method, AiR-M, decomposes the answer prediction process into a sequence of reasoning steps, in which each step contains a reasoning operation and a corresponding attention map. However, AiR-M neglects the variable number of reasoning steps for different questions and pads the reasoning step sequence with invalid steps, which introduces inaccurate information into answer prediction and thus limits the model performance. In this paper, we propose a Dynamic Alternative Attention model ( $$\textrm{DA}^{2}$$ ) to address this problem. Specifically, $$\textrm{DA}^{2}$$ consists of a feature extraction module denoted as $$\textrm{DA}^{2}$$ -f and a training module denoted as $$\textrm{DA}^{2}$$ -t. $$\textrm{DA}^{2}$$ -f is used to provide the answer prediction progress with more accurate visual information by adaptively filtering out the visual regions of invalid steps. And $$\textrm{DA}^{2}$$ -t improves model training by masking out the attention maps corresponding to invalid steps in the objective function. Experimental results on the GQA dataset verify the effectiveness of our proposed method.
What problem does this paper attempt to address?