Abstract:The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ($>50$ nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.

What problem does this paper attempt to address?

The paper attempts to address the current lack of a comprehensive benchmark for evaluating the causal learning capabilities of large language models (LLMs). Specifically, existing research has the following shortcomings: 1. **Limited dataset scale**: Most studies use causal networks either from private datasets or with a very limited number of nodes, which restricts the evaluation of LLMs' performance on large-scale datasets. 2. **Single evaluation task**: Most studies focus only on identifying pairwise causal relationships, neglecting tasks such as correlation identification and more complex causal network identification, which can better demonstrate LLMs' ability to understand causal relationships at different levels of difficulty. 3. **Poor prompt format information**: Prompts in existing studies usually contain only variable names, lacking rich semantic information, and fail to fully utilize LLMs' capabilities in long-text understanding and prior knowledge integration. 4. **Limited types of LLMs evaluated**: Most studies evaluate only a few types of LLMs, which may affect the generality and representativeness of the evaluation results. To fill these gaps, the paper proposes a comprehensive benchmark named CausalBench, aimed at systematically evaluating the causal learning capabilities of LLMs through diversified datasets, multi-level evaluation tasks, and rich prompt formats. The main advantages of CausalBench include: - **Diversified datasets**: Covering 15 commonly used real-world causal learning datasets with nodes ranging from 5 to 109. - **Multi-level evaluation tasks**: Including three tasks—correlation identification, causal skeleton identification, and causal relationship identification—to evaluate LLMs' causal learning capabilities at different levels of difficulty. - **Rich prompt formats**: Designed four prompt formats, namely variable names, variable names + training data, variable names + background knowledge, and a combination of the three, to evaluate LLMs' performance under different information scopes. - **Showcasing LLMs' upper limits**: By evaluating causal relationships of different scales and complexities, showcasing the upper limits of LLMs in causal learning. Through these improvements, CausalBench not only provides a comprehensive framework for evaluating the causal learning capabilities of LLMs but also offers valuable references for future research.

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

CausalGraph2LLM: Evaluating LLMs for Causal Queries

LLM4Causal: Democratized Causal Tools for Everyone via Large Language Model

Causality for Large Language Models

Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning

Can Large Language Models Infer Causation from Correlation?

Large Language Model for Causal Decision Making

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Do LLMs Have the Generalization Ability in Conducting Causal Inference?

CLadder: Assessing Causal Reasoning in Language Models

From Query Tools to Causal Architects: Harnessing Large Language Models for Advanced Causal Discovery from Data

LLMs Are Prone to Fallacies in Causal Inference

Cause and Effect: Can Large Language Models Truly Understand Causality?

OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework

Causal Evaluation of Language Models

CELLO: Causal Evaluation of Large Vision-Language Models

Evaluation Methods and Measures for Causal Learning Algorithms

Is Knowledge All Large Language Models Needed for Causal Reasoning?

Evaluating Large Language Models for Causal Modeling

Improving Causal Reasoning in Large Language Models: A Survey