DIEM: Decomposition-Integration Enhancing Multimodal Insights

Xinyi Jiang,Guoming Wang,Junhao Guo,Juncheng Li,Wenqiao Zhang,Rongxing Lu,Siliang Tang
DOI: https://doi.org/10.1109/cvpr52733.2024.02578
2024-01-01
Abstract:In image question answering, due to the abundant and sometimes redundant information, precisely matching and integrating the information from both text and images is a challenge. In this paper, we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then in-tegrate these sub-elements by matching each subquestion with its relevant sub-images, while also retaining the original image, to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis, followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model, and evaluate its performance on ScienceQA and MM-Vet. Ex-perimental results indicate that our method boosts accu-racy in most question classes of the ScienceQA (+2.03% in average), especially in the image modality (+3.40%). On MM-Vet, our method achieves an improvement in MM-Vet scores, increasing from 31.1 to 32.4. These findings high-light DIEM's effectiveness in harmonizing the complexities of multimodal data, demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process.
What problem does this paper attempt to address?