Abstract:In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively integrate domain expert knowledge in Statistical Causal Discovery (SCD) to create more consistent and meaningful causal models. Although existing SCD methods can automatically discover causal relationships from data, these methods may produce inaccurate results without background knowledge. This is because real - world phenomena often do not match the assumptions in SCD algorithms, and it is very difficult to obtain experimental and systematic datasets sufficient for causal inference. While observational datasets are easy to obtain, they are vulnerable to selection bias and measurement error. Therefore, enhancing the performance of SCD algorithms, especially when dealing with closed data not included in the pre - training datasets of large language models (LLM), is an important challenge. The paper proposes a new methodology. By using the "Statistical Causal Prompting (SCP)" technique, it combines the knowledge of LLM with SCD methods, aiming to improve the accuracy and reliability of causal models. Specifically, the method in the paper first analyzes the dataset using SCD methods without background knowledge, then uses LLM to generate detailed knowledge about causal relationships between variables, and evaluates the probabilities of these causal relationships through SCP techniques. Finally, these probabilities are converted into a prior knowledge matrix and reapplied to the SCD process, thereby improving the results of causal discovery. Through this method, the paper not only shows that LLM can provide background knowledge helpful for SCD, even if these datasets have never been included in the pre - training data of LLM, but also proves that this combination can significantly improve the statistical validity of causal models, especially when dealing with biased datasets. This indicates that LLM has great potential in enhancing data - driven causal inference and can improve the quality of causal models across multiple scientific fields.

Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Bridging Causal Discovery and Large Language Models: A Comprehensive Survey of Integrative Approaches and Future Directions

Large Language Model for Causal Decision Making

From Query Tools to Causal Architects: Harnessing Large Language Models for Advanced Causal Discovery from Data

Large Language Models for Constrained-Based Causal Discovery

Causal Dataset Discovery with Large Language Models

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Evaluating Large Language Models for Causal Modeling

Causality for Large Language Models

LLM4Causal: Democratized Causal Tools for Everyone via Large Language Model

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Is Knowledge All Large Language Models Needed for Causal Reasoning?

LLM-initialized Differentiable Causal Discovery

Multi-Agent Causal Discovery Using Large Language Models

Cause and Effect: Can Large Language Models Truly Understand Causality?

Large Language Models are Effective Priors for Causal Graph Discovery

Causal Inference with Large Language Model: A Survey

Counterfactual Causal Inference in Natural Language with Large Language Models

Can We Utilize Pre-trained Language Models within Causal Discovery Algorithms?

Discovery of the Hidden World with Large Language Models

Causal Structure Learning Supervised by Large Language Model