Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Masayuki Takayama,Tadahisa Okuda,Thong Pham,Tatsuyoshi Ikenoue,Shingo Fukuma,Shohei Shimizu,Akiyoshi Sannai
2024-05-22
Abstract:In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.
Machine Learning,Artificial Intelligence,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively integrate domain expert knowledge in Statistical Causal Discovery (SCD) to create more consistent and meaningful causal models. Although existing SCD methods can automatically discover causal relationships from data, these methods may produce inaccurate results without background knowledge. This is because real - world phenomena often do not match the assumptions in SCD algorithms, and it is very difficult to obtain experimental and systematic datasets sufficient for causal inference. While observational datasets are easy to obtain, they are vulnerable to selection bias and measurement error. Therefore, enhancing the performance of SCD algorithms, especially when dealing with closed data not included in the pre - training datasets of large language models (LLM), is an important challenge. The paper proposes a new methodology. By using the "Statistical Causal Prompting (SCP)" technique, it combines the knowledge of LLM with SCD methods, aiming to improve the accuracy and reliability of causal models. Specifically, the method in the paper first analyzes the dataset using SCD methods without background knowledge, then uses LLM to generate detailed knowledge about causal relationships between variables, and evaluates the probabilities of these causal relationships through SCP techniques. Finally, these probabilities are converted into a prior knowledge matrix and reapplied to the SCD process, thereby improving the results of causal discovery. Through this method, the paper not only shows that LLM can provide background knowledge helpful for SCD, even if these datasets have never been included in the pre - training data of LLM, but also proves that this combination can significantly improve the statistical validity of causal models, especially when dealing with biased datasets. This indicates that LLM has great potential in enhancing data - driven causal inference and can improve the quality of causal models across multiple scientific fields.