CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Tianshi Zheng,Jiaxin Bai,Yicheng Wang,Tianqing Fang,Yue Guo,Yauwai Yim,Yangqiu Song
2024-07-30
Abstract:While large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks by acquiring rich factual knowledge from their broad training data, their ability to synthesize and logically reason with this knowledge in complex ways remains underexplored. In this work, we present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities through a novel benchmark of automatically generated complex reasoning questions over general domain and biomedical knowledge graphs. Our extensive experiments, employing diverse in-context learning techniques, reveal that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge. We find that prompting with explicit Chain-of-Thought demonstrations can substantially improve LLM performance on complex logical reasoning tasks with diverse logical operations. Interestingly, our controlled evaluations uncover an asymmetry where LLMs display proficiency at set union operations, but struggle considerably with set intersections - a key building block of logical reasoning. To foster further work, we will publicly release our evaluation benchmark and code.
Computation and Language
What problem does this paper attempt to address?
The paper primarily focuses on evaluating the capabilities of large language models (LLMs) in complex logical reasoning, particularly in tasks that involve multi-step logical reasoning combined with factual knowledge. Specifically, the research team has constructed a new benchmark framework—CLR-Fact (Complex Logical Reasoning over Factual Knowledge)—to systematically assess the ability of state-of-the-art large language models to perform complex logical reasoning when handling factual knowledge from knowledge graphs. The core contributions of the paper can be summarized as follows: 1. **CLR-Fact Evaluation Framework**: A novel evaluation framework is proposed to comprehensively assess the ability of large language models to perform complex logical reasoning involving the combination of factual knowledge. The framework supports various reasoning modes and knowledge graphs from different domains. 2. **Comprehensive Evaluation Benchmark**: A comprehensive evaluation benchmark consisting of 5,200 complex reasoning questions has been constructed, covering 26 different logical modes. These questions require models to perform multi-step logical operations such as intersection, union, negation, and multi-hop reasoning across entities and relations in knowledge graphs. The benchmark includes general domain knowledge from Freebase and specialized biomedical domain knowledge extracted from PrimeKG. 3. **Extensive Experimental Evaluation**: Extensive experiments were conducted to evaluate eight state-of-the-art large language models using various context learning techniques. Additionally, specialized experiments were designed to explore the core capabilities of models in different set operations, which form the basis of complex logical reasoning. The research findings indicate that: - Large language models exhibit strong reasoning capabilities when handling general knowledge but face challenges when dealing with specialized domain knowledge (e.g., biomedical facts). - Models perform poorly on problems involving negation or set complement operations, indicating limitations in understanding and reasoning about negative statements and set exclusion operations. - Models perform well on set union operations but encounter significant difficulties with set intersection operations, reflecting an asymmetrical mastery in combining sets and identifying common elements. - Chain-of-Thought (CoT) prompting techniques have been shown to effectively improve model performance on complex problems requiring multi-step logical reasoning. Overall, this work reveals the strengths and limitations of current large language models in complex logical reasoning and provides valuable insights for future research. To facilitate further research and development, the authors plan to publicly release the dataset and code.