Abstract:Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general artificial intelligence, there has been a surge of interest in investigating the reasoning ability of the LLMs. Whereas the textual and numerical reasoning benchmarks adopted by previous works are rather shallow and simple, it is hard to conclude that the LLMs possess strong reasoning ability by merely achieving positive results on these benchmarks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems that require common-sense planning by evaluating their performance on the reinforcement learning benchmarks. In this work, we conduct an in-depth assessment of several state-of-the-art LLMs' reasoning ability based on the inductive logic programming (ILP) benchmark, which is broadly recognized as a representative and challenging measurement for evaluating logic program induction/synthesis systems as it requires inducing strict cause-effect logic to achieve robust deduction on independent and identically distributed (IID) and out-of-distribution (OOD) test samples. Our evaluations illustrate that compared with the neural program induction systems which are much smaller in model size, the state-of-the-art LLMs are much poorer in terms of reasoning ability by achieving much lower performance and generalization using either natural language prompting or truth-value matrix prompting.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in relational reasoning. In particular, compared with neural program induction models (Neural Program Induction, NPI), how do LLMs perform in tasks requiring logical reasoning. The paper points out that although LLMs have made remarkable progress in natural - language processing, software engineering and other fields and have performed excellently in many benchmark tests, these benchmark tests are often relatively simple and cannot comprehensively evaluate the reasoning capabilities of LLMs. Therefore, the author uses inductive logic programming (Inductive Logic Programming, ILP) benchmarks to deeply evaluate the relational reasoning capabilities of several state - of - the - art LLMs, in order to more accurately measure the performance of LLMs in complex logical reasoning tasks. Specifically, the paper focuses on the following research questions: 1. How are the relational reasoning capabilities of LLMs under standard natural - language prompts? 2. How are the relational reasoning capabilities of LLMs under truth - value matrix prompts? 3. Can the latest prompting techniques effectively improve the relational reasoning capabilities of LLMs? To answer these questions, the author designed a general evaluation pipeline, including a sample generator, a modality compiler and an evaluation module, to evaluate the performance of LLMs and NPI models on two benchmark tasks: family - tree reasoning and general - graph reasoning. The evaluation results show that although LLMs perform well on some relatively simple tasks, their performance is significantly inferior to that of NPI models on tasks requiring complex logical reasoning. This indicates that current LLMs still have much room for improvement in terms of relational reasoning capabilities.

LLMs for Relational Reasoning: How Far are We?

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Can Large Language Models Reason? A Characterization via 3-SAT

Can LLMs Reason in the Wild with Programs?

Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Reliable Reasoning Beyond Natural Language

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Beyond LLMs: Advancing the Landscape of Complex Reasoning

Conditional and Modal Reasoning in Large Language Models

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Are LLMs the Master of All Trades? : Exploring Domain-Agnostic Reasoning Skills of LLMs