Abstract:Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. Firstly, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Secondly, different from previous evaluations relying only on simple metrics (e.g., \emph{accuracy}), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including \emph{answer correctness}, \emph{explain correctness}, \emph{explain completeness} and \emph{explain redundancy}. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., \emph{evidence selection process} and \emph{reasoning process}. Thirdly, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., \emph{Correct}, \emph{Rigorous}, \emph{Self-aware}, \emph{Active}, \emph{Oriented} and \emph{No hallucination}). It reflects the pros and cons of LLMs and gives guiding directions for future works.

Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

LLMs for Relational Reasoning: How Far are We?

Large Language Models Are Not Strong Abstract Reasoners

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Can Large Language Models Reason? A Characterization via 3-SAT

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Can Large Language Models Act as Symbolic Reasoners?

Reliable Reasoning Beyond Natural Language

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Benchmarking Large Language Models for Math Reasoning Tasks

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Reasoning with Large Language Models, a Survey

Case Study: Testing Model Capabilities in Some Reasoning Tasks

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

GLoRE: Evaluating Logical Reasoning of Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning