Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee,Woochang Sim,Donghyeon Shin,Wongyu Seo,Jiwon Park,Seokki Lee,Sanha Hwang,Sejin Kim,Sundong Kim

2024-09-13

Abstract:The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been results-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstraction and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.

Computation and Language,Artificial Intelligence,Emerging Technologies,Symbolic Computation

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore the reasoning capabilities of large language models (LLMs) and evaluate these models' abilities in reasoning and contextual understanding using the Abstraction and Reasoning Corpus (ARC) dataset. Specifically, the paper attempts to address the following core issues: 1. **Evaluation Methods for Reasoning Ability**: - Existing evaluation methods focus too much on the results, making it difficult to assess the reasoning process. Therefore, the paper introduces a new method that uses the ARC dataset to evaluate the reasoning abilities of LLMs. 2. **Logical Consistency, Compositionality, and Productivity**: - Based on the "Language of Thought Hypothesis" (LoTH), the paper evaluates the reasoning abilities of LLMs from three aspects: logical consistency, compositionality, and productivity. These three aspects are considered key components of human reasoning. 3. **Advantages of ARC as a Benchmark**: - The ARC dataset requires the extraction of compositional semantics and their combination to solve problems, which is consistent with the views of LoTH. Additionally, ARC allows for task modification and generation, facilitating flexible adjustment of objectives. 4. **Differences in Reasoning Abilities Between LLMs and Humans**: - Experimental results show that although LLMs exhibit basic understanding abilities in some aspects, they still fall short of humans in terms of logical consistency, compositionality, and productivity. Therefore, the paper proposes some directions for improvement to enhance the reasoning abilities of LLMs. In summary, by systematically analyzing and evaluating the performance of LLMs on the ARC dataset, this paper reveals the deficiencies in the reasoning abilities of LLMs and proposes methods for improvement.

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Towards Reasoning in Large Language Models: A Survey

Large Language Models Are Not Strong Abstract Reasoners

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Disentangling Memory and Reasoning Ability in Large Language Models

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Intelligence Analysis of Language Models

Large Language Models Are In-Context Semantic Reasoners Rather Than Symbolic Reasoners

Reasoning in Large Language Models: A Geometric Perspective

Can Large Language Models Act as Symbolic Reasoners?

"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Can Large Language Models Reason? A Characterization via 3-SAT

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

LLMs for Relational Reasoning: How Far are We?

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences