Studying and improving reasoning in humans and machines

Nicolas Yax,Hernan Anlló,Stefano Palminteri

2023-09-22

Abstract:In the present study, we investigate and compare reasoning in large language models (LLM) and humans using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. To do so, we presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an in-depth comparison between humans and LLMs indicated important differences with human-like reasoning, with models limitations disappearing almost entirely in more recent LLMs releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally-responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to study and compare the reasoning abilities of large language models (LLMs) and humans. Specifically, the authors use a series of cognitive psychology tools, particularly classic cognitive experiments, to assess whether these models exhibit reasoning errors and cognitive biases similar to those of humans. **The main issues include:** 1. **Do LLMs exhibit bounded rationality like humans?** - The authors designed new variants of cognitive tests to avoid models generating memorized answers from classic tests already present in the training data, thereby genuinely evaluating the models' reasoning abilities. 2. **Are there differences in reasoning abilities among different versions of LLMs?** - The authors compared GPT models of varying complexity and tuning levels (such as GPT-3, GPT-3.5, and GPT-4) to explore the impact of model complexity on reasoning performance. 3. **Can prompt engineering improve the reasoning performance of LLMs?** - The authors tested different prompting strategies, such as "chain of thought" and "in-context learning," to evaluate whether these methods can enhance the models' reasoning accuracy. 4. **Do humans and LLMs respond consistently to the same prompting strategies?** - The authors compared the performance of humans and models under different prompting conditions to explore the similarities and differences in their reasoning processes. Through these studies, the authors hope not only to better understand the cognitive abilities of LLMs but also to explore whether these models can be used as tools for studying human decision-making and reasoning.

Studying and improving reasoning in humans and machines

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

(Ir)rationality and cognitive biases in large language models

Analysis of hybrid imaging techniques

Birth defects among children born to a population occupationally exposed to pesticides in Colombia.

Language models show human-like content effects on reasoning tasks

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

Can Large Language Models Reason? A Characterization via 3-SAT

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Towards Reasoning in Large Language Models: A Survey

Reasoning with Large Language Models, a Survey

Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

Case Study: Testing Model Capabilities in Some Reasoning Tasks

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

When Do Program-of-Thought Works for Reasoning?

Comparing Rationality Between Large Language Models and Humans: Insights and Open Questions

Rational Metareasoning for Large Language Models

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Advances in reasoning by prompting large language models: A survey

Uncovering the Data-Related Limits of Human Reasoning Research: An Analysis based on Recommender Systems