Do Large Language Models Solve ARC Visual Analogies Like People Do?

Gustaw Opiełka,Hannes Rosenbusch,Veerle Vijverberg,Claire E. Stevenson

2024-05-13

Abstract:The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily explores the performance of large language models (LLMs) in solving visual analogy problems and compares it with human performance, particularly that of children. The study focuses on the Abstraction Reasoning Corpus (ARC) analogy reasoning test, a task set designed to evaluate the visual analogy capabilities of both humans and machines. Specifically, the researchers created a simplified version of the ARC tasks—KidsARC-Simple and KidsARC-Concept—aimed at involving children as well as large language models. Key findings of the paper include: - Both children and adults generally outperform most large language models in solving these visual analogy tasks. - Error analysis revealed a "fallback" strategy in large language models similar to that of young children, which involves partially copying elements from the analogy. - The study also identified two main types of errors: one based on seemingly understanding key concepts (e.g., the concept of "inside and outside") but deviating in execution, and the other based on simple combinations of the input matrix. - "Concept" errors are more common among human participants, while "matrix" errors are more prevalent in large language models. - These findings provide new insights into the reasoning capabilities of large language models and how they approach visual analogy problems. In summary, the paper reveals the differences between humans and large language models in solving visual analogy tasks and highlights the limitations of large language models in such tasks through comparative analysis.

Do Large Language Models Solve ARC Visual Analogies Like People Do?

Do large language models solve verbal analogies like children do?

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Evaluating the Robustness of Analogical Reasoning in Large Language Models

Large Language Models Are Not Strong Abstract Reasoners

Intelligence Analysis of Language Models

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Abstract Visual Reasoning Enabled by Language

Beneath Surface Similarity: Large Language Models Make Reasonable Scientific Analogies after Structure Abduction

LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

Language models show human-like content effects on reasoning tasks

Language models, like humans, show content effects on reasoning tasks

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Semantic Structure-Mapping in LLM and Human Analogical Reasoning

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Evaluating the Deductive Competence of Large Language Models

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Can Large Language Models Act as Symbolic Reasoners?

ARN: Analogical Reasoning on Narratives

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models