Abstract:Visual reasoning is a special kind of visual question answering, which is essentially multi-step and compositional, and also requires intensive text-visual interaction. The most important and challenging problem of visual reasoning is to design an effective and robust visual reasoning model. To this end, there are two challenges to overcome. The first is that textual and visual information must be jointly considered to make accurate inferences about reasoning. The second is that existing deep learning-based works are often too specific to a particular task. To address these issues, we propose a knowledge memory embedding model with mutual modulation for visual reasoning. This approach learns not only knowledge-based embeddings derived from key–value memory network to make the full and joint of textual and visual information, but also exploits the prior knowledge to improve the performance with knowledge-based representation learning for applying other general reasoning tasks. Experimental results on four benchmarks show that the proposed approach significantly improves performance compared with other state-of-the-art methods, guarantees the robustness with our model. Most importantly, we apply our model to four reasoning tasks, and experimentally show that our model effectively supports relational reasoning and improves performance in several tasks and datasets.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two main challenges in Visual Reasoning: 1. **Joint consideration of text and visual information**: In order to perform visual reasoning accurately, the model must consider text and visual information simultaneously. This means that the model needs to effectively fuse the information from the image and the question text and make accurate inferences. 2. **Generalization ability of the model**: Existing deep - learning - based visual reasoning models are often too specific to a certain task and difficult to generalize to other visual reasoning tasks. Therefore, designing a robust visual reasoning model that can be generalized to other tasks is an important challenge. To solve these problems, the author proposes a new Knowledge - embedded Memory Model (KM4), which improves the effect of visual reasoning in the following ways: - **Mutual Modulation**: By alternately modulating image and language information at each step, the model can make fuller use of text and visual information. - **Knowledge - based Key - Value Memory Network**: By introducing a knowledge base, the model can use prior knowledge to enhance representation learning, thereby improving performance and supporting other general reasoning tasks. Specifically, the main contributions of the KM4 model include: - Proposing an end - to - end, robust and effective knowledge - embedded memory model, which explicitly makes full use of text and visual information throughout the process. - Designing a novel key - value memory network, which significantly improves the performance of visual reasoning tasks and enables the model to be generalized to other reasoning tasks through knowledge - based representation learning. The experimental results show that the KM4 model outperforms existing methods on four benchmark datasets (CLEVR, NLVR, NLVR2 and GQA), especially achieving an average accuracy of 99.9% on the CLEVR dataset. In addition, the model has also achieved state - of - the - art results on four reasoning tasks (Diagnosing Visual Reasoning, Referring Expression Comprehension, Relational and Analogical Visual Reasoning and Visual Entailment). In summary, this paper aims to solve the problems of information fusion and model generalization in visual reasoning by introducing knowledge embedding and mutual modulation mechanisms, thereby improving the performance and robustness of visual reasoning models.

KM 4 : Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

Knowledge-Embedded Mutual Guidance for Visual Reasoning

Webly Supervised Knowledge-Embedded Model for Visual Reasoning

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering

Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Explicit Knowledge Incorporation for Visual Reasoning

Explicit Knowledge-based Reasoning for Visual Question Answering

Perceptual Visual Reasoning with Knowledge Propagation

Toward Accurate Visual Reasoning with Dual-Path Neural Module Networks.

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Learning Visual Knowledge Memory Networks for Visual Question Answering

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory