CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

Yang Liu,Weixing Chen,Guanbin Li,Liang Lin
DOI: https://doi.org/10.48550/arXiv.2306.17462
2023-12-13
Abstract:We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc. These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system. It not only includes training and inference codes, but also provides model weights. We believe this toolbox is by far the most complete visual-linguitic causal reasoning toolbox. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to re-implement existing methods and develop their own new causal reasoning methods. Code and models are available at <a class="link-external link-https" href="https://github.com/HCPLab-SYSU/CausalVLR" rel="external noopener nofollow">this https URL</a>. The project is under active development by HCP-Lab's contributors and we will keep this document updated.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that current large - language models (LLMs) overly rely on fitting a wide range of knowledge distributions in multimodal tasks, thus capturing spurious correlations between different modalities. This makes it difficult for the model to learn reliable reasoning chains (Chain - of - Thought, COT) that reflect the essential causal relationships of multimodal knowledge, thereby limiting its generalization and cognitive abilities. Specifically, the paper points out: 1. **The problem of spurious correlations**: Current LLMs are prone to capturing spurious cross - modal correlations rather than true causal relationships when handling vision - language tasks. For example, in tasks such as visual question answering (VQA), image/video caption generation, and medical report generation, the model may make incorrect inferences due to biases in the data. 2. **Insufficient generalization ability**: Due to the lack of understanding of causal relationships, existing models perform poorly in the face of new data or environmental changes and have limited generalization ability. 3. **Limited cognitive ability**: Existing models have difficulty in performing in - depth cognitive reasoning and cannot understand the causal connections between things like humans. To solve these problems, the paper proposes the CausalVLR toolkit, aiming to improve the model performance in vision - language tasks by introducing causal reasoning methods. CausalVLR contains a series of state - of - the - art causal relationship discovery and causal inference methods, which are applicable to various vision - language reasoning tasks, such as VQA, image/video caption generation, medical report generation, model generalization, and robustness. These methods can help the model learn more reliable causal relationships, thereby improving its generalization ability and cognitive level. ### Main contributions 1. **Modular design**: Decompose the vision - language reasoning framework into different components, facilitating users to build customized reasoning frameworks according to their needs. 2. **Support for multiple frameworks**: Provide support for popular vision - language reasoning frameworks. 3. **High efficiency**: All basic modules and operations are executed on the GPU to ensure optimal performance. 4. **State - of - the - art methods**: Based on the experience of the HCP - Lab team, provide the latest causal reasoning and vision - language reasoning algorithms and keep updating them. Through these improvements, the CausalVLR toolkit hopes to provide researchers with a flexible and powerful tool for re - implementing existing methods and developing new causal reasoning methods, thereby promoting the further development of the vision - language causal reasoning field.