Studying and improving reasoning in humans and machines

Nicolas Yax,Hernan Anlló,Stefano Palminteri
2023-09-22
Abstract:In the present study, we investigate and compare reasoning in large language models (LLM) and humans using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. To do so, we presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an in-depth comparison between humans and LLMs indicated important differences with human-like reasoning, with models limitations disappearing almost entirely in more recent LLMs releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally-responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to study and compare the reasoning abilities of large language models (LLMs) and humans. Specifically, the authors use a series of cognitive psychology tools, particularly classic cognitive experiments, to assess whether these models exhibit reasoning errors and cognitive biases similar to those of humans. **The main issues include:** 1. **Do LLMs exhibit bounded rationality like humans?** - The authors designed new variants of cognitive tests to avoid models generating memorized answers from classic tests already present in the training data, thereby genuinely evaluating the models' reasoning abilities. 2. **Are there differences in reasoning abilities among different versions of LLMs?** - The authors compared GPT models of varying complexity and tuning levels (such as GPT-3, GPT-3.5, and GPT-4) to explore the impact of model complexity on reasoning performance. 3. **Can prompt engineering improve the reasoning performance of LLMs?** - The authors tested different prompting strategies, such as "chain of thought" and "in-context learning," to evaluate whether these methods can enhance the models' reasoning accuracy. 4. **Do humans and LLMs respond consistently to the same prompting strategies?** - The authors compared the performance of humans and models under different prompting conditions to explore the similarities and differences in their reasoning processes. Through these studies, the authors hope not only to better understand the cognitive abilities of LLMs but also to explore whether these models can be used as tools for studying human decision-making and reasoning.