Abstract:Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above problem is to separate the information and process it separately at the functional level. By reusing each processing unit, the generalization ability of the model to deal with different data can be increased. In this paper, we propose Independent Inference Units (IIU) for fine-grained multi-modal reasoning to decompose intra-modal information by the functionally independent units. Specifically, IIU processes each semantic-specific intra-modal clue by an independent inference unit, which also collects complementary information by communication from different units. To further reduce the impact of redundant information, we propose a memory update module to maintain semantic-relevant memory along with the reasoning process gradually. In comparison with existing non-pretrained multi-modal reasoning models on standard datasets, our model achieves a new state-of-the-art, enhancing performance by 3%, and surpassing basic pretrained multi-modal models. The experimental results show that our IIU model is effective in disentangling intra-modal clues as well as reasoning units to provide explainable reasoning evidence. Our code is available at <a class="link-external link-https" href="https://github.com/Lilidamowang/IIU" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the complexity, lack of interpretability, and insufficient generalization ability of existing methods in handling cross - modality and intra - modality correlations in knowledge - driven Visual Question Answering (VQA). Specifically: 1. **Limitations of Existing Methods**: Current methods mainly focus on modeling cross - modality and intra - modality correlations, which leads to complex multi - modal cues being entangled through implicit embeddings, thus lacking interpretability and generalization ability. 2. **Key Challenges**: To overcome the above problems, the key lies in separating information and processing this information separately at the functional level to improve the generalization ability and interpretability of the model. 3. **Solutions**: The authors propose Independent Inference Units (IIU) for fine - grained multi - modal reasoning, decomposing intra - modality information through functionally independent units. Specifically, IIU assigns an independent inference unit to each semantically specific intra - modality cue and collects complementary information through communication between different units. In addition, to further reduce the impact of redundant information, the authors also propose a memory update module to gradually maintain semantically relevant memory during the reasoning process. ### Specific Improvement Points - **Information Separation at the Functional Level**: By separating information to the functional level and processing it separately, the generalization ability of the model is improved. - **Reducing the Impact of Redundant Information**: Through the memory update module, it is ensured that semantic information is not distorted during the reasoning process and the impact of redundant information is further reduced. - **Enhanced Interpretability**: By visualizing the units under different activated information modalities, the good interpretability of the model is demonstrated. ### Experimental Results Compared with existing non - pre - trained multi - modal reasoning models, the IIU model has a 3% performance improvement on the standard dataset and outperforms the basic pre - trained multi - modal model. The experimental results show that the IIU model is effective in decomposing intra - modality cues and providing interpretable reasoning evidence. ### Summary This paper solves the problems of complexity and insufficient generalization ability in existing knowledge - driven VQA methods by introducing the IIU model, significantly improving the performance and interpretability of the model.

IIU: Independent Inference Units for Knowledge-based Visual Question Answering

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

DIEM: Decomposition-Integration Enhancing Multimodal Insights

Perceptual Visual Reasoning with Knowledge Propagation

KM 4 : Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

Explicit Knowledge-based Reasoning for Visual Question Answering

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering

Joint Answering and Explanation for Visual Commonsense Reasoning

Interpretable Visual Question Answering Referring to Outside Knowledge

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Knowledge Condensation and Reasoning for Knowledge-based VQA

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Detection-based Intermediate Supervision for Visual Question Answering

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering

Visual Question Answering With Dense Inter- and Intra-Modality Interactions

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning