Do Large Language Models Solve ARC Visual Analogies Like People Do?

Gustaw Opiełka,Hannes Rosenbusch,Veerle Vijverberg,Claire E. Stevenson
2024-05-13
Abstract:The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the performance of large language models (LLMs) in solving visual analogy problems and compares it with human performance, particularly that of children. The study focuses on the Abstraction Reasoning Corpus (ARC) analogy reasoning test, a task set designed to evaluate the visual analogy capabilities of both humans and machines. Specifically, the researchers created a simplified version of the ARC tasks—KidsARC-Simple and KidsARC-Concept—aimed at involving children as well as large language models. Key findings of the paper include: - Both children and adults generally outperform most large language models in solving these visual analogy tasks. - Error analysis revealed a "fallback" strategy in large language models similar to that of young children, which involves partially copying elements from the analogy. - The study also identified two main types of errors: one based on seemingly understanding key concepts (e.g., the concept of "inside and outside") but deviating in execution, and the other based on simple combinations of the input matrix. - "Concept" errors are more common among human participants, while "matrix" errors are more prevalent in large language models. - These findings provide new insights into the reasoning capabilities of large language models and how they approach visual analogy problems. In summary, the paper reveals the differences between humans and large language models in solving visual analogy tasks and highlights the limitations of large language models in such tasks through comparative analysis.