KM 4 : Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

Wenbo Zheng,Lan Yan,Chao Gou,Fei-Yue Wang
DOI: https://doi.org/10.1016/j.inffus.2020.10.007
IF: 18.6
2021-03-01
Information Fusion
Abstract:Visual reasoning is a special kind of visual question answering, which is essentially multi-step and compositional, and also requires intensive text-visual interaction. The most important and challenging problem of visual reasoning is to design an effective and robust visual reasoning model. To this end, there are two challenges to overcome. The first is that textual and visual information must be jointly considered to make accurate inferences about reasoning. The second is that existing deep learning-based works are often too specific to a particular task. To address these issues, we propose a knowledge memory embedding model with mutual modulation for visual reasoning. This approach learns not only knowledge-based embeddings derived from key–value memory network to make the full and joint of textual and visual information, but also exploits the prior knowledge to improve the performance with knowledge-based representation learning for applying other general reasoning tasks. Experimental results on four benchmarks show that the proposed approach significantly improves performance compared with other state-of-the-art methods, guarantees the robustness with our model. Most importantly, we apply our model to four reasoning tasks, and experimentally show that our model effectively supports relational reasoning and improves performance in several tasks and datasets.
computer science, artificial intelligence, theory & methods
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in Visual Reasoning: 1. **Joint consideration of text and visual information**: In order to perform visual reasoning accurately, the model must consider text and visual information simultaneously. This means that the model needs to effectively fuse the information from the image and the question text and make accurate inferences. 2. **Generalization ability of the model**: Existing deep - learning - based visual reasoning models are often too specific to a certain task and difficult to generalize to other visual reasoning tasks. Therefore, designing a robust visual reasoning model that can be generalized to other tasks is an important challenge. To solve these problems, the author proposes a new Knowledge - embedded Memory Model (KM4), which improves the effect of visual reasoning in the following ways: - **Mutual Modulation**: By alternately modulating image and language information at each step, the model can make fuller use of text and visual information. - **Knowledge - based Key - Value Memory Network**: By introducing a knowledge base, the model can use prior knowledge to enhance representation learning, thereby improving performance and supporting other general reasoning tasks. Specifically, the main contributions of the KM4 model include: - Proposing an end - to - end, robust and effective knowledge - embedded memory model, which explicitly makes full use of text and visual information throughout the process. - Designing a novel key - value memory network, which significantly improves the performance of visual reasoning tasks and enables the model to be generalized to other reasoning tasks through knowledge - based representation learning. The experimental results show that the KM4 model outperforms existing methods on four benchmark datasets (CLEVR, NLVR, NLVR2 and GQA), especially achieving an average accuracy of 99.9% on the CLEVR dataset. In addition, the model has also achieved state - of - the - art results on four reasoning tasks (Diagnosing Visual Reasoning, Referring Expression Comprehension, Relational and Analogical Visual Reasoning and Visual Entailment). In summary, this paper aims to solve the problems of information fusion and model generalization in visual reasoning by introducing knowledge embedding and mutual modulation mechanisms, thereby improving the performance and robustness of visual reasoning models.