Abstract:We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc. These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system. It not only includes training and inference codes, but also provides model weights. We believe this toolbox is by far the most complete visual-linguitic causal reasoning toolbox. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to re-implement existing methods and develop their own new causal reasoning methods. Code and models are available at <a class="link-external link-https" href="https://github.com/HCPLab-SYSU/CausalVLR" rel="external noopener nofollow">this https URL</a>. The project is under active development by HCP-Lab's contributors and we will keep this document updated.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that current large - language models (LLMs) overly rely on fitting a wide range of knowledge distributions in multimodal tasks, thus capturing spurious correlations between different modalities. This makes it difficult for the model to learn reliable reasoning chains (Chain - of - Thought, COT) that reflect the essential causal relationships of multimodal knowledge, thereby limiting its generalization and cognitive abilities. Specifically, the paper points out: 1. **The problem of spurious correlations**: Current LLMs are prone to capturing spurious cross - modal correlations rather than true causal relationships when handling vision - language tasks. For example, in tasks such as visual question answering (VQA), image/video caption generation, and medical report generation, the model may make incorrect inferences due to biases in the data. 2. **Insufficient generalization ability**: Due to the lack of understanding of causal relationships, existing models perform poorly in the face of new data or environmental changes and have limited generalization ability. 3. **Limited cognitive ability**: Existing models have difficulty in performing in - depth cognitive reasoning and cannot understand the causal connections between things like humans. To solve these problems, the paper proposes the CausalVLR toolkit, aiming to improve the model performance in vision - language tasks by introducing causal reasoning methods. CausalVLR contains a series of state - of - the - art causal relationship discovery and causal inference methods, which are applicable to various vision - language reasoning tasks, such as VQA, image/video caption generation, medical report generation, model generalization, and robustness. These methods can help the model learn more reliable causal relationships, thereby improving its generalization ability and cognitive level. ### Main contributions 1. **Modular design**: Decompose the vision - language reasoning framework into different components, facilitating users to build customized reasoning frameworks according to their needs. 2. **Support for multiple frameworks**: Provide support for popular vision - language reasoning frameworks. 3. **High efficiency**: All basic modules and operations are executed on the GPU to ensure optimal performance. 4. **State - of - the - art methods**: Based on the experience of the HCP - Lab team, provide the latest causal reasoning and vision - language reasoning algorithms and keep updating them. Through these improvements, the CausalVLR toolkit hopes to provide researchers with a flexible and powerful tool for re - implementing existing methods and developing new causal reasoning methods, thereby promoting the further development of the vision - language causal reasoning field.

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

CELLO: Causal Evaluation of Large Vision-Language Models

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Causalvis: Visualizations for Causal Inference

Causal Reasoning Meets Visual Representation Learning: A Prospective Study

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Vision-and-Language Navigation via Causal Learning

Visual Causal Scene Refinement for Video Question Answering

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs

LLM4Causal: Democratized Causal Tools for Everyone via Large Language Model

Large Language Model for Causal Decision Making

Causal Agent based on Large Language Model

Improving Causal Reasoning in Large Language Models: A Survey

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Causal-ViT: Robust Vision Transformer by causal intervention

Causal Inference with Knowledge Distilling and Curriculum Learning for Unbiased VQA

OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework

Causal Evaluation of Language Models