Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee,Woochang Sim,Donghyeon Shin,Wongyu Seo,Jiwon Park,Seokki Lee,Sanha Hwang,Sejin Kim,Sundong Kim
2024-09-13
Abstract:The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been results-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstraction and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.
Computation and Language,Artificial Intelligence,Emerging Technologies,Symbolic Computation
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the reasoning capabilities of large language models (LLMs) and evaluate these models' abilities in reasoning and contextual understanding using the Abstraction and Reasoning Corpus (ARC) dataset. Specifically, the paper attempts to address the following core issues: 1. **Evaluation Methods for Reasoning Ability**: - Existing evaluation methods focus too much on the results, making it difficult to assess the reasoning process. Therefore, the paper introduces a new method that uses the ARC dataset to evaluate the reasoning abilities of LLMs. 2. **Logical Consistency, Compositionality, and Productivity**: - Based on the "Language of Thought Hypothesis" (LoTH), the paper evaluates the reasoning abilities of LLMs from three aspects: logical consistency, compositionality, and productivity. These three aspects are considered key components of human reasoning. 3. **Advantages of ARC as a Benchmark**: - The ARC dataset requires the extraction of compositional semantics and their combination to solve problems, which is consistent with the views of LoTH. Additionally, ARC allows for task modification and generation, facilitating flexible adjustment of objectives. 4. **Differences in Reasoning Abilities Between LLMs and Humans**: - Experimental results show that although LLMs exhibit basic understanding abilities in some aspects, they still fall short of humans in terms of logical consistency, compositionality, and productivity. Therefore, the paper proposes some directions for improvement to enhance the reasoning abilities of LLMs. In summary, by systematically analyzing and evaluating the performance of LLMs on the ARC dataset, this paper reveals the deficiencies in the reasoning abilities of LLMs and proposes methods for improvement.