Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

Zhiyuan Li,Heng Wang,Dongnan Liu,Chaoyi Zhang,Ao Ma,Jieting Long,Weidong Cai
2024-08-30
Abstract:Large Language Models (LLMs) have showcased exceptional ability in causal reasoning from textual information. However, will these causalities remain straightforward for Vision Large Language Models (VLLMs) when only visual hints are provided? Motivated by this, we propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge VLLMs to infer semantic cause-and-effect relationship when solely relying on visual cues such as action, appearance, clothing, and environment. Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues, which can effectively evaluate VLLMs' causal reasoning capabilities. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess VLLMs' comprehension abilities. Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped. Furthermore, we perform a comprehensive analysis to understand these models' shortcomings from different views and suggest directions for future research. We hope MuCR can serve as a valuable resource and foundational benchmark in multimodal causal reasoning research. The project is available at: <a class="link-external link-https" href="https://github.com/Zhiyuan-Li-John/MuCR" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient evaluation of the capabilities of current Visual Large - Language Models (VLLMs) in multimodal causal reasoning. Specifically, the author proposes and constructs a new benchmark named MuCR (Multimodal Causal Reasoning benchmark) to challenge and evaluate the ability of VLLMs to perform causal reasoning relying solely on visual cues. #### Background and Motivation 1. **Limitations of Existing Benchmarks**: - **Lack of Visual Modality**: Existing causal reasoning benchmarks mainly focus on the text modality and cannot fully evaluate the visual understanding ability of VLLMs. - **Lack of Multi - Image Understanding**: Current causal reasoning tasks usually involve only a single image and fail to evaluate the causal reasoning ability across multiple images. - **Lack of Causal Relationship Questions**: Some existing multi - image understanding benchmarks do not include causal relationship questions, so they cannot comprehensively evaluate the causal reasoning ability of VLLMs. 2. **Research Motivation**: - The author hopes to explore whether VLLMs can reach a similar level of causal reasoning in the text modality as Large - Language Models (LLMs) when relying solely on visual cues. - By introducing the MuCR benchmark, the author hopes to more comprehensively evaluate the capabilities of VLLMs in multimodal causal reasoning and provide valuable resources for future research. #### Main Features of the MuCR Benchmark 1. **Twin - Image Generation**: - Use a prompt - driven image synthesis method to generate siamese images embedded with semantic causal relationships and visual cues, in order to effectively evaluate the causal reasoning ability of VLLMs. 2. **Multi - Level Evaluation**: - **Image - Level Evaluation**: Evaluate the ability of VLLMs to identify visual cues and semantic causal relationships in images. - **Phrase - Level Evaluation**: Test the ability of VLLMs to distinguish correct cue phrases. - **Sentence - Level Evaluation**: Evaluate the ability of VLLMs to explain causal relationships. 3. **Customized Metrics**: - Develop customized evaluation metrics from multiple perspectives (including image - level matching, phrase - level understanding, and sentence - level explanation) to comprehensively evaluate the understanding ability of VLLMs. #### Experimental Results and Analysis 1. **Experimental Setup**: - Evaluate multiple open - source and in - house - developed VLLMs on the MuCR benchmark, including BLIP2, OpenFlamingo, InstructBLIP, MiniGPT4, LLaVA, Claude, Gemini, and GPT - 4 series models. - Evaluate the performance of different models by combining image input forms. 2. **Experimental Results**: - The results show that the current state - of - the - art VLLMs still have limited capabilities in multimodal causal reasoning, especially the open - source models perform poorly. - In - house - developed models such as the GPT - 4 series perform better, but still do not reach the human level. 3. **Analysis and Discussion**: - The research finds that general LLM enhancement strategies (such as chain - of - thought and context learning) have a limited impact on the MuCR benchmark, and sometimes even have a negative impact. - The multi - image input form may be a promising research direction and can significantly improve the performance of VLLMs. - Case studies show that the main problems of open - source models lie in visual perception ability and the hallucination phenomenon, while in - house - developed models are easily influenced by strong causal knowledge priors in the language model, resulting in the neglect of visual evidence. #### Conclusion By introducing the MuCR benchmark, the author proposes a brand - new multimodal causal reasoning evaluation framework, reveals the current deficiencies of VLLMs in this field, and provides directions for future research. The MuCR benchmark can not only evaluate the causal reasoning ability of VLLMs but also provide valuable references for improving these models. --- If you have more questions or need further information, please feel free to let me know!